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Advance Praise 


"Thrilling and provocative ... There is a need for such a book... There's nothing quite like this out there. An epic tale 
of biology's central molecule, RNA. 

DNA does only one thing, store information. RNA has a thrilling plethora of functions, including telling DNA 
what to do. 

This book takes the reader on an odyssey through the wonders of RNA and its central role in biology. 

DNA science dominated the second half of the 20th Century, but it's clear that the 21st Century belongs to RNA. 
This long-overdue book reveals the diverse wonders of RNA in a series of thrilling and provocative stories." 


Tom Cech, Nobel laureate, University of Colorado Boulder 


"The book is truly monumental and will be treasured by RNA scientists and others, as well. It beautifully captures 
the excitement and wonder that I have been lucky to experience working in the RNA field since the early 1960s." 


Joan Steitz, Yale University 


"This book is really disruptive and presents a coherent view of our understanding of biology in terms of the genetic 
molecules, the nucleic acids, DNA and RNA. It covers an immense territory of molecular biology and its history of 
discoveries, all presented with a clear-cut intellectual thread. 

... It is very timely by its breadth and emphasis on the role of RNA in biology. It makes a strong case for RNA and 
its late acceptance... the fight uphill, like that of Sisyphus, was tough and demanded a lot of perseverance. It is really 
rather complete." 


Eric Westhof, University of Strasbourg 

"The book is unique. It provides the long-overdue correction of the still widespread static views on evolution, devel- 

opment and genome organization and function. It has the potential to induce radical changes in widely held views 
and attitudes." 

Peter Vogt, Scripps Research Institute, La Jolla 

"History is the key to our modern understanding of RNA. This magnum opus describes how science, scientific 

thought and landmark discoveries revealed the central role of RNA in molecular biology and evolution. The authors 


are not only modern pioneers of RNA science, but also the best histo-RNA-ians of our time." 


John Rinn, University of Colorado Boulder 
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RNA, the Epicenter of Genetic 
Information 


The origin story and emergence of molecular biology is muddled. The early triumphs in bacterial genetics and the 
complexity of animal and plant genomes complicate an intricate history. This book documents the many advances, as 
well as the prejudices and founder fallacies. It highlights the premature relegation of RNA to simply an intermediate 
between gene and protein, the underestimation of the amount of information required to program the development 
of multicellular organisms, and the dawning realization that RNA is the cornerstone of cell biology, development, 
brain function and probably evolution itself. Key personalities and their hubris as well as prescient predictions are 
richly illustrated with quotes, archival material, photographs, diagrams and references to bring the people, ideas and 
discoveries to life, from the conceptual cradles of molecular biology to the current revolution in the understanding of 
genetic information. 


Key Features 


* Documents the confused early history of DNA, RNA and proteins — a transformative history of molecular 
biology like no other. 


* Integrates the influences of biochemistry and genetics on the landscape of molecular biology. 


* Chronicles the important discoveries, preconceptions and misconceptions that retarded or misdirected 
progress. 


* Highlights the major pioneers and contributors to molecular biology, with a focus on RNA and non-coding 
DNA. 


* Summarizes the mounting evidence for the central roles of non-protein-coding RNA in cell and develop- 
mental biology. 


* Provides a thought-provoking retrospective and forward-looking perspective for advanced students and 
professional researchers. 
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Preface 


RNA likely underpinned the emergence of life, yet it 
is arguably the least appreciated of all biological mac- 
romolecules. For most of the past century, RNA has 
been regarded principally as an intermediary between 
gene and protein. However, most of the human genome 
expresses RNAs that do not encode proteins, which begs 
the question: why? 

The understanding of the functions of RNA and the 
answer to this question are bound up with the history of 
molecular biology. The term ‘molecular biology’ was 
coined by the mathematician Warren Weaver in 1938! 
and has become synonymous with the nature, transmis- 
sion and manifestation of genetic information, and the 
structure of the molecules involved. The field had its 
roots in the discovery of DNA in 1869 and the identifica- 
tion of proteins and their enzymatic activity in the late 
19th century, key events that gave birth to “the science of 
the chemistry of life". Since then, proteins and DNA have 
been the primary focus of studies of cellular and devel- 
opmental processes and the conceptions of ‘genes’, ‘gene 
expression' and 'gene regulation'. 

While it was clear early on that chromosomes are the 
vehicles of inheritance, and contain DNA, RNA and pro- 
teins, for a long time it was thought that genetic infor- 
mation is held in the proteins; nucleic acids seemed too 
simple. In the 1940s, however, it was shown that DNA 
is the reservoir of genetic information, although it took 
some time for this finding to be accepted. 

The connection between DNA and protein produc- 
tion was solved by the convergence of genetics and bio- 
chemistry, mainly in experimentally amenable bacteria 
and fungi, which led to the breathtaking advances that 
elucidated the role of *messenger RNA (mRNA) and the 
'genetic code' in the 1950s and 1960s. The assumption 
that genetic information is mostly transacted by proteins 
(‘one gene — one enzyme’), with RNA a transient inter- 
mediate, became entrenched, reflecting the mechanical 
Zeitgeist of the age. 

This assumption led to many subsidiary assumptions 
about the nature of genetic information, and the conclu- 
sion that most of the genomes of plants and animals are 
junk, based on theoretical considerations of mutational 
load and the finding that protein-coding sequences 
occupy only a small fraction of animal and plant 
genomes. The naivety of this conclusion and its super- 
ficial support by intrinsically circular assessments of the 


‘neutral evolution’ of ‘non-coding’ sequences in genomes 
were rarely challenged. 

There were other assumptions as well, including that 
heritable information is not transmitted from somatic 
cells to reproductive cells. This assertion, supported 
by a peculiar 1868 experiment involving amputation of 
mice tails, accompanied the so-called Modern Synthesis 
in the 1930s, which reconciled Mendelian genetics with 
Darwinian evolution and ruled out Lamarckian evolu- 
tion, to buttress the belief that ‘mutations’ are random. 

Undoubtedly the biggest surprises in the history of 
molecular biology were the discoveries in the 1960s and 
1970s that plant and animal genomes are replete with 
‘repetitive elements’ and that their genes are mosaics 
of fragmented protein-coding and flanking regulatory 
mRNA sequences (‘exons’) separated by extensive tracts of 
intervening sequences (‘introns’), which are subsequently 
removed from the primary transcripts by splicing. It was 
immediately and almost universally assumed that introns 
are evolutionary relics colonized by ‘selfish’ genetic hobos 
and that the excised intronic RNA is simply degraded. 

Also unexpected were the discoveries in the 1980s 
that RNA has catalytic capacity and, at the turn of the 
century, that the number of protein-coding genes in 
humans is similar to that in nematode worms that only 
have ~1,000 cells. By contrast, the extent of intronic and 
‘intergenic’ non-protein-coding DNA sequences was 
found to increase with developmental complexity, rising 
to ~98% in humans and other mammals. 

High-throughput expression studies revealed that 
these ‘non-coding’ sequences are transcribed in spa- 
tially restricted patterns to produce hundreds of thou- 
sands of RNAs that do not encode proteins. Many of 
these RNAs were subsequently shown to have regulatory 
and organizational functions during differentiation and 
development. 

Here we provide an account of the development of 
molecular biology from the 19th century to the present. 
We pay particular attention to the history of the under- 
standing of RNA, which has been neglected. We also 
discuss the founder fallacies — where initial interpreta- 
tions of limited data were generalized, became orthodox 
explanations and then articles of faith. Our central theme 
is that the extrapolation of bacterial genetics to complex 
organisms, compounded by expectational, ascertain- 
ment and interpretative biases, has led to a linked series 


xi 


xii 


of false dichotomies and the misunderstanding of roles 
of RNA in the transmission of genetic information. The 
subsidiary theme is the clumsy progress of science. 

This book focuses on RNA as the main player in cell 
and developmental biology, but also on chromatin com- 
position and regulatory logic. While most educated in 
the pre-genomic era were taught that gene regulation is 
primarily carried out by proteins, this became hard to 
reconcile with the finding that genes encoding regulatory 
RNAs vastly outnumber protein-coding genes in humans, 
and the demonstrations of widespread sequence-specific 
guidance of effector proteins by RNAs. The simplicity 
and logic of base-pairing for sense-antisense target rec- 
ognition and the ability of RNA to form complex three- 
dimensional structures are almost as old as the double 
helix itself. The existence of regulatory RNAs was hinted 
during the early period of molecular biology by genetic 
observations in fruit flies and maize, and by the appear- 
ance of unexplained bands in biochemical fractionations, 
but these were treated as oddities or interpreted through 
the lens of transcription factors, until the genome projects 
revealed the full extent of RNA expression in plants and 
animals. 

We highlight the pioneers and controversies that 
accompanied the many unexpected observations, with 
particular attention to those that challenged the prevail- 
ing consensus, often ignored, at least at first. The book 
spans the early confusion about the functions of proteins 
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and nucleic acids, the elucidation of the double helix and 
the “genetic code’, the premature relegation of RNA to 
intermediary between gene and protein, the strange 
genomes and genetics of plants and animals, and the 
misguided musings that underpinned the idea of junk 
DNA. We chronicle the spectacular advances brought 
by gene cloning and genome sequencing, the small and 
large regulatory RNA revolutions, and the slowly dawn- 
ing realization of the central role of transposon-derived 
sequences, intrinsically disordered proteins, ‘enhancers’ 
and RNA-directed epigenetic processes in multicellular 
development, which we have tried to integrate into a new 
framework for understanding genetic programming. 

We have cited original references where possible, to 
give credit to the work of others and to provide the evi- 
dence for our assertions and conclusions, especially in 
relation to the findings of the last two decades. We have 
also included extensive footnotes that add detail and can 
be skipped, as well as suggestions for further reading. 

While the story is still unfolding, we conclude that the 
genomes of humans and other complex organisms are 
not full of junk but rather are highly compact informa- 
tion suites that are largely devoted to the specification of 
regulatory RNAs. These RNAs drive the trajectories of 
differentiation and development, underpin brain func- 
tion and convey transgenerational memory of experience, 
much of it contrary to long-held conceptions of genetic 
programming and the dogmas of evolutionary theory. 
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1 Overview 


THE GENETIC MATERIAL? 


Proteins have been associated with all biological pro- 
cesses since the inception of biochemistry in the 19th 
century, given their abundance, enzymatic properties 
and versatility. On the other hand, although identified 
in 1869, the functions of nucleic acids remained obscure 
until the 1940s. In the years leading up to the turn of 
the 20th century, DNA was found to be localized in 
chromosomes, which were shown to be the vehicles for 
genetic inheritance. In 1909, it was proposed that nucleic 
acids form simple tetramers containing each of the four 
component nucleotides, following which it was generally 
thought, because of their presumed repetitive structure, 
that nucleic acids have only peripheral functions. 

Accordingly, proteins, which are also found in chro- 
mosomes, were regarded as the repository of genetic 
information for the first four decades of the 20th century, 
with DNA functioning as a scaffold. However, in 1944, 
DNA was demonstrated to be the ‘transforming princi- 
ple’ in bacteria, although this finding was only widely 
accepted after bacteriophage infection ‘pulse-chase’ 
experiments in 1952, the elucidation of the structure of 
DNA in 1953 and the demonstration of its semi-conser- 
vative replication in 1958. 

While having a nucleotide composition similar to 
DNA, RNA did not appear to play a role in the intergen- 
erational transmission of genetic information, although 
RNA viruses were later found to exist. It was regarded for 
decades as an uninteresting metabolic molecule in bacte- 
ria, yeast and plants, and only conclusively shown to exist 
in animal cells in the 1930s. 

Microbial genetics from the 1920s established that 
(some) genes encode proteins, but the mechanism by 
which this occurred was unknown. Gradually it dawned 
that RNA might be involved, inferred from histochemi- 
cal, ultracentrifugation and spectroscopic studies in the 
1940s that showed that RNA is present in cytoplasmic 
microsomes (ribosomes), which were becoming recog- 
nized as the sites of protein synthesis. 

Meanwhile, theoretical biologists had declared, in 
the so-called Modern Synthesis reconciling Darwinian 
evolution with Mendelian genetics, that mutations are 
random, and that Lamarckian inheritance of experi- 
ence does not occur. Moreover, the emphasis on lethal 
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protein-coding mutations muddled the interpretation 
of genetic variation, with ongoing debates between the 
‘Mendelians’ and the quantitative geneticists. 


HALCYON DAYS 


The abundant ribosomal RNAs (rRNAs) were identi- 
fied in the mid-1950s, but a specific function for RNA 
was demonstrated only in 1958, when small RNAs were 
shown to act as ‘adaptors’ for the incorporation of amino 
acids into microsomal proteins, named ‘transfer RNAs’ 
(tRNAs). In 1961, the radioactive labeling of ‘messenger 
RNAs’ (mRNAs) finally identified the ‘unstable’ inter- 
mediate between genes and proteins, establishing the 
connection. 

In the following decade, the triplet “genetic code’ for 
protein synthesis was deciphered. Analysis of the lactose 
(lac) operon of Escherichia coli cemented the conclu- 
sion that genes are synonymous with proteins. The regu- 
lation of gene activity by protein ‘transcription factors’ 
was established and assumed to hold not just in bacteria 
but also in developmentally complex organisms. All that 
remained to do, it seemed, was to flesh out the details. 


WORLDS APART 


It was obvious by that time that plants and animals are 
orders of magnitude more complex than bacteria and 
have different cellular and genetic features, including 
much greater internal compartmentalization and far 
larger genomes. It was later shown that eukaryotic cells 
arose by fusion of bacterial and archaeal cells and that 
developmentally complex organisms burst onto the scene 
in spectacular adaptive radiations, most likely following 
regulatory innovations required to orchestrate organized 
cell division and differentiation. 

Studies using newer techniques in the 1960s and 1970s 
showed that that eukaryotic DNA is packaged in a repeat- 
ing structure (‘nucleosomes’) comprised of basic proteins 
called histones, and that chromatin is compacted and 
remodeled during development. It was found that histones 
are dynamically modified by methylation and acetylation, 
which suggested that histone modifications act as a regu- 
latory mechanism. It was also shown that RNA is associ- 
ated with chromatin and that very high molecular weight 


‘heterogeneous’ RNAs are synthesized in the nucleus, 
predicted to be precursors of mRNAs, but the function 
of the remainder of these transcripts was mysterious. 


STRANGE GENOMES, STRANGE GENETICS 


The use of the fruit fly Drosophila melanogaster as a 
model genetic system from the 1910s enabled the map- 
ping of genes along chromosomes by measuring recom- 
bination distances (co-inheritance frequencies), which 
established the view of genes as discrete, ‘particulate’ 
entities. Analysis of naturally occurring and radiation- 
induced mutations identified ‘homeotic’ loci that caused 
bizarre segmental transformations along with other 
encoding epigenetic ‘modifiers’ that exhibited strange 
interactions. 

Odd genetic phenomena were also reported in plants. 
‘Rogue’ non-Mendelian patterns of inheritance were 
Observed in peas in 1915 and characterized in other spe- 
cies from the 1950s, termed ‘paramutation’, later under- 
stood to be a feature of transgenerational epigenetic 
inheritance. Mobile ‘controlling elements’ were identi- 
fied in maize in the 1940s and shown to be due to the 
transposition of regulatory cassettes. In the mid-1960s, 
large fractions of the genomes of plants and animals 
were found to be comprised of ‘repetitive sequences’, 
most of which derive from transposable elements. It was 
also found that the repetitive sequences are differentially 
transcribed. 

In 1969, these disparate molecular observations were 
integrated into a schema of gene regulation in embryonic 
development, which included the concepts of ‘structural’ 
(protein-coding) and integrator” genes (most likely) 
expressing regulatory RNAs recognized by cognate 
receptor sequences, connected into networks by repeti- 
tive sequences. Processed nuclear RNAs were posited in 
other models to be global regulators of gene expression, 
but the problem was the lack of detail about the actual 
information in genomes, which rendered these mod- 
els, as reasonable as they were, speculative and largely 
overlooked. 


THE AGE OF AQUARIUS 


The problem of lack of detail began to be solved by the 
gene cloning revolution and the development of DNA 
sequencing in the 1970s. These technologies led an explo- 
sion in knowledge, and by the mid-1990s shotgun cloning 
and sequencing was being used to characterize the many 
mRNAs that had eluded identification by biochemical 
and genetic assays. A myriad of protein-coding genes 
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was discovered in organisms from bacteria to humans, 
including those that regulate development, cell division, 
cell differentiation, cell signaling, trafficking pathways 
and immunological responses, among many others, as 
well as mutated versions in cancer. These advances, how- 
ever, diverted attention from the broader questions of 
genome regulation and reinforced the concept of genes 
as protein-coding. 


ALL THAT JUNK 


By the 1970s, it was evident, however, that most sequences 
in the genomes of complex organisms are not protein-cod- 
ing (Figure 1.1). The amount of cellular DNA was found to 
broadly increase with developmental complexity, but there 
were incongruities, termed the C-value enigma. Theoretical 
considerations of population genetics, the lethality of 
protein-coding mutations, the presence of large numbers of 
repetitive sequences and seemingly defective ‘pseudogenes’ 
all suggested that some, and perhaps most, multicellular 
organisms carry substantial loads of non-functional DNA. 

The corollary of ‘neutral’ evolution of non-functional 
sequences was widely accepted, although there was 
debate between the ‘near-neutralists’ and ‘adaptationists’ 
concerning the signatures of protein-coding genes (and, 
later, regulatory sequences) underpinning quantitative 
trait variation. Nonetheless, there was growing consen- 
sus that much if not most of the DNA in plant and ani- 
mal genomes must be junk and that the many repetitive 
sequences are ‘selfish’ genetic hobos. 

The discovery in 1977 that eukaryotic genes are 
mosaics of short fragments of mRNA protein-coding 
and flanking regulatory sequences (‘exons’) interspersed 
with non-coding sequences (‘introns’) that are removed 
by post-transcriptional splicing explained heterogeneous 
nuclear RNA and was proffered as further evidence of 
junk. Introns were rationalized as the remnants of the 
prebiotic assembly of genes, which had been purged from 
microbial genomes under selective pressure for rapid rep- 
lication, even though the ancestors of complex organisms 
were also microbial. On the other hand, while small in 
unicellular eukaryotes, introns were found to increase 
in number and size with the developmental complexity 
of multicellular organisms, which suggested that these 
sequences had acquired important functions. 


THE EXPANDING REPERTOIRE OF RNA 


In parallel with the gene cloning revolution, the increas- 
ing sophistication of biochemical techniques identified 
relatively abundant RNA species beyond the canonical 
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trio of tRNA, mRNA and rRNA. These included small 
nuclear RNAs (snRNAs) that guide splicing and other 
aspects of gene expression; small nucleolar RNAs (snoR- 
NAs) that guide modifications of rRNAs, tRNAs and 
snRNAs; 7SK RNA, a negative regulator of transcription; 
7SL RNA, an essential component of the ‘signal recog- 
nition particle’ that targets proteins to the endoplasmic 
reticulum and precursor of the ubiquitous Alu elements in 
the human genome; ‘vault’ RNAs of mysterious function 
but known to be involved in recycling of cellular com- 
ponents in lysosomes and neuronal synaptic plasticity; 
and rodent brain-specific transposon-derived RNAs that 
modulate behavior. 

In the 1980s, RNAs were discovered to have self- 
splicing and cleavage activities, and that RNA catalyzes 
both translation and splicing, leading to the conclusion 
that RNA was the primordial molecule of life — the RNA 
World' hypothesis, whereby RNA subsequently out- 
sourced its enzymatic functions to the more versatile pro- 
teins and its information functions to the more stable and 
easily replicable DNA. The early examples of the struc- 
tural and functional capacities of RNA were, however, 
largely interpreted as relic infrastructural components 
rather than another dimension of molecular biology. 


GLIMPSES OF A MODERN RNA WORLD 


In the decades leading up to the turn of the century and 
shortly thereafter, as analytical sensitivity improved, 
many less abundant RNAs were identified. Small anti- 
sense RNAs (‘riboregulators’) and cis-acting RNA struc- 
tures (‘riboswitches’) were found to control transcription 
and translation in bacteria, the latter by allosteric sens- 
ing of metabolites and environmental signals. Synthetic 
antisense oligonucleotides began to be used to artificially 
control gene expression in eukaryotic cells. 

Overlapping ‘antisense’ transcription and ‘nested’ 
genes within genes were observed in animals and plants, 
hinting at intertwined genetic information and regulatory 
complexity. Differentially transcribed long ‘untranslated’ 
RNAs were reported to regulate ribosomal RNA tran- 
scription, and to be produced from the regulatory regions 
of homeotic and heat shock—induced genes in Drosophila 
and mammalian immunoglobulin class-switching, can- 
cer-associated and parentally imprinted loci, among 
others. 

Xist was identified as a long non-coding RNA 
that mediates female X-chromosome inactivation 
in mammals, and analogous RNAs mediating male 
X-chromosome activation were identified in Drosophila. 
3’ untranslated regions (3"UTRs) in mRNAs were found 


to be separately expressed and to transmit genetic infor- 
mation independently of their normally associated 
protein-coding sequences, and small RNAs antisense to 
3'UTRs were found to control developmental timing in 
C. elegans. Although some speculated that these small 
and large RNAs may be the first examples of a more 
extensive RNA regulatory system in cell and develop- 
mental biology, they were generally regarded as oddities. 


GENOME SEQUENCING AND 
TRANSPOSABLE ELEMENTS 


By the mid-1990s, the extraordinary advances in DNA 
cloning, amplification and sequencing had made feasible 
the sequencing of whole genomes. The subsequent expo- 
nential growth of data led to progressively well-annotated 
genome databases and suites of computational tools for 
gene prediction, ortholog identification and the analysis 
of gene structure and expression. For the first time, the 
full complement of DNA sequence information in bacte- 
ria and archaea, protists, fungi, plants and animals began 
to be revealed, enabling comparative genomics to interro- 
gate evolutionary relationships and functional indices at 
increasingly high resolution, including in complex micro- 
bial ecologies. 

Prokaryote genomes were confirmed to be domi- 
nated by protein-coding genes, with phenotypic diversity 
achieved primarily by proteomic variation. On the other 
hand, animals differing by orders of magnitude in devel- 
opmental complexity were unexpectedly found to have a 
similar number and repertoire of protein-coding genes — 
only about 20,000 in both nematodes and mammals - the 
“G-value enigma’. 

By contrast, increased developmental complexity 
correlated with the extent of non-protein-coding DNA, 
reaching over 98% in humans and other mammals, 
indicating that the developmental sophistication of mul- 
ticellular organisms is achieved by the expansion of regu- 
latory information. Moreover, transposable element and 
retroviral-derived repetitive sequences began to be rec- 
ognized as major drivers of phenotypic innovation in a 
wide range of plants and animals. 


THE HUMAN GENOME 


The first draft of the human genome sequence was pub- 
lished in 2001, notwithstanding the controversies that 
surrounded the project. The number of identified human 
protein-coding genes was far lower than expected by most 
in the field. Comparison with the mouse genome sug- 
gested that ~95% of the human genome is non-functional, 


based on the assumption that ancient transposon-derived 
sequences can be used to measure the rate of neutral evo- 
lution. On the other hand, analyses of genomic features 
such as transcription, sequence accessibility, DNA and 
histone modifications and transcription factor binding led 
to the conclusion that most of the human genome exhibits 
biochemical indices of function. 

Human *Mendelian' disorders were mapped and con- 
firmed to be largely due to disabling mutations in protein- 
coding sequences. By contrast, genome-wide association 
studies showed that variations affecting complex traits 
and disorders reside mainly in non-coding regions of the 
genome, although an appreciable fraction of the known 
genetic contribution to these traits appeared unaccounted, 
suggesting other factors at play. 


SMALL RNAs WITH MIGHTY FUNCTIONS 


Genetic observations in the 1980s and 1990s indicated that 
RNA may play a general role in gene regulation, when it 
was reported that sense and antisense RNAs could modu- 
late endogenous gene expression transcriptionally and 
post-transcriptionally, referred to as ‘co-suppression’, ‘gene 
silencing’ and (ultimately) ‘RNA interference’ (RNAi). 
The finding that introducing sense and antisense RNAs 
together resulted in strong systemic repression of target 
genes led to the dissection of the RNAi pathways, showing 
that double-stranded RNAs are processed to form ‘small 
interfering RNAs’ (siRNAs) that guide DNA methylation 
and cleavage of orthologous sequences in mRNAs. 

At the turn of the 21st century, it was discovered that 
the RNAi pathway is used extensively to control gene 
expression during animal and plant development, via 
‘microRNAs’ (miRNAs) derived from introns and other 
non-protein-coding transcripts. Related small RNAs, 
‘piRNAs’, many produced from repetitive sequences, 
were found to be required for fertility, germ and stem 
cell development in animals. Other classes of small 
regulatory RNAs were found to be derived from tRNAs, 
rRNAs, snoRNAs, snRNAs, gene promoters and splice 
junctions, and small RNAs were shown to have many 
functions, including intergenerational and interspecies 
communication. 

A similar pathway, termed ‘CRISPR’, was later found 
in bacteria to use RNA guides to target cleavage of bac- 
teriophage genomes, manipulation of which has revolu- 
tionized genetic analysis and genetic engineering. The 
common feature of the RNAi and CRISPR pathways is 
that they use small RNAs to guide generic effector pro- 
teins to target cognate sequences in RNAs and DNAs, a 
highly efficient and flexible system of gene control. 
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LARGE RNAs WITH MANY FUNCTIONS 


The high-throughput RNA profiling projects that fol- 
lowed the genome projects revealed the existence in both 
animals and plants of large numbers of low abundance 
long, often multi-exonic, RNAs that have little or no 
protein-coding potential. These ‘long non-coding RNAs’ 
(IncRNAs) were found to be expressed “intergenically”, 
intronically and antisense to or overlapping protein-coding 
genes, as well as from thousands of ‘pseudogenes’ and 
3'UTRs. The data also showed that most of the genome of 
eukaryotes is transcribed in highly complex overlapping 
patterns, substantially from both DNA strands. 

Although initially suspected to be noise, IncRNAs 
were found to be dynamically expressed during differ- 
entiation and development, mostly in cell-type specific 
patterns. LacRNAs were also found be associated with 
membrane-less cellular organelles, chromatin-modifying 
proteins and/or chromatin domains. While the genetic 
signatures of IncRNAs are, in the main, subtler than pro- 
tein-coding genes, many have been shown to be involved 
in cancer and developmental, autoimmune, neurodegen- 
erative and neuropsychiatric disorders. Large numbers 
of IncRNAs - many of which are clade- or species-spe- 
cific — were also discovered to have functions in cell fate 
determination and reprogramming, DNA damage repair, 
germ layer specification, hematopoietic, immunological 
and neuronal differentiation, retinal, skeletal, muscle and 
brain development, and memory and behavior, among 
many others. 


THE EPIGENOME 


It became increasingly evident during this period that the 
chromosomes of higher organisms are highly organized 
and epigenetically modified. Cytogenetic and molecu- 
lar studies from the 1980s had shown the existence of 
chromosome territories, gene-rich and gene-poor regions 
and fine-scale 'topologically associated domains' with 
variable GC contents and non-random distributions of 
sequences derived from transposable elements. New 
genetic loci termed enhancers, hundreds of thousands of 
which exist in mammalian genomes, were identified and 
found to control plant and animal development by selec- 
tive activation of protein-coding genes in their vicinity. 
Nucleosomes were shown to contain canonical and 
specialist histones, some specific to mammalian germ 
and neuronal cells. The histones were found to be sub- 
ject to a bewildering variety of post-translational modi- 
fications that are imposed, interpreted and erased by 
protein complexes that often have no intrinsic sequence 
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specificity, including many essential for the developmen- 
tal regulation of gene expression. 

Histone modifications were shown to vary by gene 
expression and differentiation state. Exons were found to 
be preferentially located in nucleosomes, suggesting that 
epigenetic control of gene expression can be exon-spe- 
cific. Vertebrate DNA was also found to be dynamically 
methylated during development, perturbed in cancer, and 
associated with gene repression. Little was known, how- 
ever, of the pathways that determine the locus specific- 
ity of epigenetic modifications during development or in 
response to environmental influences. 


THE PROGRAMMING OF DEVELOPMENT 


The overarching question, rarely considered, is how much 
information is required to program development? The 
nematode worm has ~1,000 leaves in its developmental 
tree that are genetically hard-wired. Similarly, humans 
and other mammals must make trillions of divide or dif- 
ferentiate cell fate decisions with high accuracy, also 
hard-wired, as evidenced by the phenotypic congruency 
of monozygotic twins. 

It had been widely assumed that Boolean combinator- 
ics of transcription and other regulatory factors acting 
on cis-acting regulatory DNA sequences would suffice 
to direct developmental ontogeny, but this proposition 
was not rigorously justified theoretically, mathematically 
or mechanistically. By contrast, a decisional tree with 
N leaves requires an exponentially greater number of 
regulatory decisions, which is consistent with the quasi- 
quadratic increase in the number of regulatory genes with 
total gene number in bacteria. In all organisms, presum- 
ably, the proportion of the genome devoted to regulatory 
information increases with metabolic, developmental or 
cognitive complexity. 

The fact that the genomes of plant and animals are 
transcribed in dynamic patterns during development and 
millions of different epigenetic marks are imposed at dif- 
ferent positions in different cells across developmental 
stages suggests that RNA regulation has been enlisted as 
the most flexible and information efficient solution to the 
challenge of orchestrating multicellular ontogeny. 


RNA RULES 


Over the past two decades, RNA has been shown to 
regulate chromosome structure through interaction with 
transposon-derived sequences. DNA methylation, often 
differentially imposed at repetitive elements, had been 
known since the 1990s to be RNA-guided. Chromatin 


remodeling proteins, sometimes referred to as 'pioneer 
transcription factors', which have little or no sequence 
specificity and address different loci at different devel- 
opmental stages, bind RNA. RNA-DNA hybrids and 
RNA-DNA-DNA triplexes were found to be common in 
eukaryotic chromatin. Histone-modifying proteins also 
have no intrinsic sequence specificity but some have been 
shown to associate with RNA ‘promiscuously’, i.e., bind 
to many different RNAs. 

The largest class of sequence-specific transcription 
factors, containing zinc finger motifs, also addresses tar- 
get loci differentially and binds RNA as well as DNA, 
with many having higher affinity for RNA-DNA hybrids 
than for double-stranded DNA. Half of the C,H, zinc 
finger proteins in the human genome contain KRAB 
domains, many primate-specific, which wire them into 
regulatory networks by binding cognate transposon- 
derived sequences. 

Enhancers were found to express non-coding RNAs 
that are required for enhancer action, which involves 
chromatin ‘looping’ to form transcriptional hubs. 
Enhancers have all of the signatures of genes, except 
that they do not encode proteins. The number of mapped 
enhancers is approximately the same as the number of 
IncRNAs expressed from the human genome, which 
resolves the G-value enigma. 

It was also discovered that most proteins involved in 
regulating gene expression in plants and animals, includ- 
ing transcription factors and histone modifiers, contain 
‘intrinsically disordered regions’ (IDRs), the fraction of 
which increases with developmental complexity. IDRs 
interact with RNAs to form phase-separated condensates, 
which are widely deployed to organize subnuclear and 
cytoplasmic domains, including topologically associated 
transcriptional hubs in chromatin. RNA interaction with 
primitive proteins containing IDRs to form phase-sepa- 
rated domains may also comprise the third dimension of 
the ancestral protocell. 

LncRNAs have a modular and highly alternatively 
spliced structure, with many domains derived from 
‘repetitive’ elements. LncRNAs also act as scaffolds and 
guides for ribonucleoprotein complexes, a highly efficient 
and flexible system that, like RNAi and CRISPR, uses 
RNA signals to regulate and direct generic protein effec- 
tors to their sites of action to program development and 
adaptive radiation. 


PLASTICITY 


Over 170 different modifications of nucleotides have 
been identified in RNA, some important for the structure, 


function or stability of rRNAs, tRNAs, snRNAs and 
snoRNAs, as well as mRNAs and other non-coding 
RNAs. These modifications have also been found to be, at 
least in some cases, reversible and to modulate the struc- 
ture-function relationships of RNAs to control processes 
as diverse as chromatin organization, stem cell differ- 
entiation, development, brain function, stress responses, 
mRNA stability and miRNA processing, among others. 
RNA modifications have been used to allow mRNA vac- 
cines to evade the innate immune response. 

RNA is also ‘edited’ by cytosine and adenosine 
deamination, to form uracil and inosine, respectively. 
Adenosine editing has expanded in vertebrate, mamma- 
lian and primate evolution, especially in the brain, and in 
humans occurs largely in Alu elements, which invaded 
the genome in three waves during primate evolution and 
occupy over 10% of the genome, with more than 1 mil- 
lion copies. 

The APOBEC enzymes that deaminate cytosine to 
form thymine or uracil are vertebrate-specific, the first 
involved in somatic rearrangement and hypermuta- 
tion of immunoglobulin domains. The ABOBECs have 
expanded under positive selection during mammalian 
and primate evolution, apparently to regulate transpos- 
able element and retroviral activity. Repetitive elements 
are mobilized in the brain, which is being shown to have 
many other unusual molecular dynamics associated with 
its ability to re-wire synaptic connections. 

Transgenerational epigenetic inheritance (such as 
‘paramutation’) was shown to involve small RNA sig- 
nals and DNA methylation. Paramutation is associated 
with simple tandem sequence repeats (STRs), over 1 
million of which are present in the human genome and 
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are enriched in promoters of protein-coding genes and 
enhancers. STR variation has been linked with psychi- 
atric disorders and cancer, as well as the modulation of 
physiological and neurological traits, suggesting that the 
extent of soft-wired inheritance of experience has been 
underestimated. 


BEYOND THE JUNGLE OF DOGMAS 


It seems that the nature of genetic information in complex 
organisms has been misunderstood since the inception of 
molecular biology, primarily because of the assumption 
that most genetic information is transacted by proteins. 
Other assumptions made during the formative years of 
genetics also appear to be incorrect, notably that muta- 
tions are random, and that epigenetic memory of experi- 
ence is not inherited. 

A transformation is taking place in the understand- 
ing of the role of RNA in evolution, inheritance, cell and 
developmental biology, brain function and disorders, 
ranging from basic science to a myriad of applications, 
including a new generation of RNA therapies. 

Genomes contain biological software encompassing 
codes for components, self-assembly, differentiation and 
reproduction, supplemented by information transmitted 
by epigenetic memories. Not only has the data evolved, 
but also the data structures, implementation systems, 
evolutionary search algorithms and the interplay between 
hard- and soft-wired inheritance. Indeed, it is likely that 
evolution has learned how to learn, and that many primi- 
tive preconceptions will have to be reevaluated, with 
more surprises in store. 

The details follow. 


Overview 
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? The Genetic Material? 


For most of human history, the nature of matter, and life, 
was a subject of speculation. Famously, in the 4th cen- 
tury BCE, the Greek philosopher Aristotle cemented the 
ruminations of his predecessor Empedocles to assert that 
all matter — organic and inorganic — was composed of 
four elements — fire, air, water and earth — the ratio of 
which determined its properties. Aristotle also asserted 
that these elements are interchangeable but did not 
accept the deduction of his predecessors, Leucippus and 
Democritus, that all matter could be reduced to indivis- 
ible particles, ‘atomos’ or atoms. Aristotle's views held 
sway for two millennia. 

In 1773, Joseph Priestley showed that heating of mer- 
cury 'calx' (mercuric oxide) not only produced liquid 
mercury, as had been known for thousands of years, but 
also a gas that caused candles to burn brightly. It was also 
known that treating metals with acids produced another, 
highly flammable, gas, termed “phlogiston”. In 1778, 
Antoine Lavoisier mixed these two gases, added a lighted 
match and observed that they combined to form water. He 
named Priestley’s gas ‘oxygen’ (from the Greek, meaning 
acid-maker) and re-named phlogiston *hydrogen' (water- 
maker). He also burned other substances such as phos- 
phorus and sulfur and showed that they combined with 
air to make new materials, ‘compounds’, gaining weight 
in the process, then in 1789 published a list of 33 chemi- 
cal elements, grouping them into gases, metals, nonmet- 
als and earths.! 

In 1794, Joseph Proust proposed that compounds have 
defined chemical formulas, and in 1803 John Dalton 
proposed the atomic theory of matter? leading in the 
following decades to the identification and prediction of 
many other atomic “elements”, the development of simple 
nomenclature (H for hydrogen, O for oxygen, C for car- 
bon, N for nitrogen, S for sulfur, P for phosphorus and 
so on), the distinction by Amedeo Avogadro in 1811 of 
atoms and combinations thereof (‘molecules’),+> and the 
development of the Periodic Table in 1869 independently 
by Dmitri Mendeleev and Julius Meyer. 

Such was the time that chemists started to explore the 
nature of matter in biology. The geneticists and physicists 
came later. 


DOI: 10.1201/9781003109242-2 


SUGARS AND FATS 


Sugars, starches, oils and fats derived from plants and 
animals had of course been known for eons and used for 
nutrition, cooking and other practical applications. 

In 1789, Frangois Poulletier de La Salle and Michel 
Chevreul described a substance that could be extracted 
by alcohol from bile stones and named it ‘cholesterine’ 
(cholesterol) from the Greek 'chole', meaning bile and 
‘stereos’, meaning solid.’ In 1815, Henri Braconnot clas- 
sified fats into ‘suifs’ (greases) and ‘huiles’ (fluid oils). 
In the next decade, Chevreul developed a more detailed 
classification, encompassing greases, tallow, waxes, 
resins, oils and volatile oils, among others? In 1847, 
Theodore Gobley isolated phospholipids from brains and 
egg yolks.! The identification of more complex forms, 
such as glycolipids and sphingolipids, came later as did 
the generic term ‘lipid’, coined in the 1920s by Gabriel 
Bertrand from the Greek ‘lipos’ (fat).!! 

Also in 1789, Lavoisier determined that sugar is com- 
posed of carbon, hydrogen and oxygen and that the fer- 
mentation of sugar by yeast produced ethanol and carbon 
dioxide, long exploited in brewing and baking.'” In 1833, 
Anselme Payen and Jean-Francois Persoz discovered 
the first enzyme activity (distase”),!5 and in 1839 Payen 
coined the term ‘cellulose’ from the French word ‘cellule’ 
for cell.!+15 

In 1857, Claude Bernard isolated and introduced the 
term ‘glycogen’ for the starch-like substance stored in the 
livers of mammals. The term ‘carbohydrate’ (French 
‘hydrate de carbone’) also originated around this time to 
describe high molecular weight chains of simple sugars 
such as glucose, whose composition could be expressed 
generally as C,(H,O),. 


PROTEINS: ‘THE LOCUS OF LIFE’ 


In 1828, Friedrich Wohler produced urea from ammo- 
nium cyanide, which showed that inorganic molecules 
can be converted into organic compounds" and demol- 
ished the widely held belief that the latter could only 
be produced through some sort of ‘vitalism’.'® In 1839, 
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Gerhardus Mulder described substances (fibrin, egg 
albumin and gelatin) that contain large amounts of C, 
H, O and N, and also S and P, which he considered “the 
most essential substances of the animal kingdom'.??? He 
shared the results with the chemist Jons Berzelius, who 
suggested that these substances be called ‘proteins’, from 
the Greek mpwtetoc (‘primarium’ or first).!?21.22 

Mulder's work also showed that proteins from animals 
and plants (which played a “principal role in their econ- 
omy’) shared a similar, but varied, atomic composition.?? 
Their molecular nature, however, remained nebulous for 
decades.???^ Although proteins initially represented more 
a concept than a defined chemical entity, it became grad- 
ually accepted that, as major constituents of all organ- 
isms, they were central to the processes of life. Indeed, 
the ‘colloidal nature” of the substances of living tissues, 
at the time popularly described as the ‘protoplasm’,*° 
exhibited many of the properties that were associated 
with proteins, and was in fact explicitly regarded a “pro- 
teinaceous substance"? 

This idea was prevalent among the early proponents of 
Darwinian evolutionary theory. One of the most promi- 
nent, Thomas Huxley, defined the protoplasm in 1868 as 
the “locus of life" and postulated that the physical basis 
of life (including heredity) lay in this universal biological 
substance.20%2 Huxley remarked: “It may be truly said, 
that the acts of all living things are fundamentally one” 
and that “all protoplasm is proteinaceous.”?” 

Similarly, in 1871, Charles Darwin cautiously specu- 
lated about the role of proteins in the origin of life, while 
stressing the common ancestry of species: 


It is often said that all the conditions for the first produc- 
tion of a living organism are now present, which could 
ever have been present. But if (and oh! what a big if!) we 
could conceive in some warm little pond, with all sorts 
of ammonia and phosphoric salts, light, heat, electric- 
ity, &c., present, that a proteine compound [emphasis 
added] was chemically formed ready to undergo still 
more complex changes, at the present day such matter 
would be instantly devoured or absorbed, which would 
not have been the case before living creatures were 
formed.% 


During the second half of the 19th century, the diversity 
of proteins became apparent and the empirical observa- 
tions made were key to framing the concepts of enzymes 
and ‘biological specificity'.%2 During this time, Louis 
Pasteur and others demonstrated the nexus between 


a We now know that RNA can nucleate colloidal domains and does so 
in many contexts (Chapter 16). 
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enzymatic activity and ‘life’ in fermentation,!? by then 
a well-established paradigm of physiological chemistry. 

Pasteur also showed, using swan-necked flasks, which 
allowed air exchange but limited microbial contamina- 
tion, that life did not arise spontaneously, and noted that 
organic molecules are chirally left-handed,^ a fundamen- 
tal step forward in biochemistry, underpinning, among 
other things, modern drug development. Also studying 
fermentation, in 1894 Emil Fischer promoted the impor- 
tance of stereochemical rules (popularized as the ‘Lock 
and Key Model’) to explain the interaction of substrate 
with enzyme, highlighting the central role and exqui- 
site specificity of the “so-called enzymes" in biological 
processes.?' In 1897, Eduard Buchner demonstrated that 
yeast extracts ferment alcohol from sugar, showing that 
biochemical processes do not necessarily require living 
cells, but are catalyzed by the enzymes formed in cells.?? 

Between 1899 and 1908, Fischer and others, notably 
Ernest Fourneau, Franz Hofmeister and Albrecht Kossel, 
made important advances in the understanding of the 
chemistry of proteins, sugars and nucleic acids, including 
the description of the peptide bond. The latter was a cru- 
cial shift in understanding how animal substances arise, 
from the traditional view that proteins are acquired from 
plants to the realization that they are synthesized from a 
set of constituent parts (amino acids‘) incorporated into 
peptide chains: the *peptide theory”, largely promoted by 
the work of Kossel.20.2334,35 

It was not until 1926 that James Sumner first purified 
an enzyme (urease),?6 but proteins were already regarded 
as the molecules that underlie life processes, the *Protein 
World'.?" Aleksandr Oparin proposed in 1924 that life 
originated on Earth through gradual chemical evolution 
of carbon-based molecules in a “primordial soup"? with 
John (J. B. S^) Haldane independently advancing a simi- 
lar theory 5 years later.2038-40 Both suggested in the 1920s 
that polypeptides were the initial particles of 'colloidal 
size’, whereas nucleic acids, which had been discov- 
ered 50years earlier and known to be major compo- 
nents of chromosomes (see below), were not mentioned. 


^ Possibly due to the chiral mutation bias of cosmic radiation.*° 
Interestingly, some amino acids were named after the plant and ani- 
mal source from which they were initially isolated: the first discov- 
ered amino acid (in 1806) was asparagine, isolated from asparagus; 
later glutamate from gluten; serine from silk (from the Latin for silk, 
‘sericum’); tyrosine (the crystals in aged cheese) from the Greek for 
cheese “tyrós”; valine from the roots of the valerian plant; glycine 
from sugarcane and gelatin (the Greek *yAvkóc', sweet tasting); etc. 
In 1954, following Avery's demonstration that DNA was the genetic 
material, Haldane speculated that nucleic acids may be more primi- 
tive than proteins.* Oparin also later agreed that RNA likely pre- 
ceded proteins in the origin of life (Chapter 8). 


e 


a 


The Genetic Material? 


The idea that proteins were the primordial molecules was 
reinforced by the famous experiment by Stanley Miller 
and Harold Urey in 1953, which demonstrated the for- 
mation of amino acids? from inorganic molecules in the 
simulated reducing and highly electrified atmospheric 
conditions presumed to exist in the primitive Earth.^»46 

Biochemistry became a recognized scientific disci- 
pline in the 1930s and 1940s with the advent of a better 
understanding of the structure-function relationships 
of proteins, driven by physicists using techniques 
such as X-ray crystallography and isotopic labeling.?? 
These advances coincided with the emergence and the 
coining of the term “molecular biology," whose focus 
is to understand the molecular basis of genetic mate- 
rial and how it determines cellular and organismal 
phenotypes. 

In this sense, molecular biology is distinct from bio- 
chemistry, although there is considerable overlap and 
many biochemists, but perhaps not molecular geneticists 
or cell and developmental biologists, would assert that the 
terms are synonymous.! In any case both new disciplines 
were centered on the study of proteins, which heavily 
influenced the conceptions of genetic information and the 
mechanisms of heredity and development. 

The period from 1930 to 1970 also saw great prog- 
ress in the characterization of intermediary metabolism, 
the other heart of biochemistry? The achievements 
included specification of the glycolytic (fermentation) 
pathway, the urea/ornithine, citric acid and glyoxylate 
cycles,? the pathways for lipid synthesis and degrada- 
tion, the synthesis of amino acids and complex carbo- 
hydrates, and the nucleoside and pentose phosphate 
pathways for the synthesis of nucleotides and enzymatic 
cofactors, notably by Hans Krebs, Gustav Embden, Otto 
Meyerhof (who also discovered the universal energy 
currency, adenosine triphosphate, ATP), Jakub Karol 
Parnas, Otto Warburg, Horace Barker, David Green, 
Peter Mitchell, Ephraim Racker and Salih Wakil, 
among many others.?050-» 


* [t was recently shown that similar conditions (involving hydrogen 
cyanide, hydrogen sulfide as the reductant, ultraviolet light as the 
energy source and copper photoredox and wet-dry cycling) can also 
produce ribonucleosides and lipid precursors.^'-^^ 

f Alan Turing, the breaker of the Nazi Enigma Code and the father 
of modern computing, posited in his influential paper in 1952, “The 
Chemical Basis of Morphogenesis’, that “a system of chemical sub- 
stances, called morphogens ... is adequate to account for the main 
phenomena of morphogenesis", noting that (in contrast) “the func- 
tion of genes is presumed to be purely catalytic .^* 
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NUCLEIC ACIDS AND CHROMOSOMES 


The history of nucleic acids? began with Friedrich 
Miescher's discovery in 1869 of "nuclein". This was a 
substance distinct from proteins, isolated from the nuclei 
of pus cells and characterized by resistance to protease 
digestion, a high content of phosphorus and the absence 
of sulfur.+36 Miescher made this discovery in the 
laboratory of Felix Hoppe-Seyler, who had the experi- 
ments repeated and 2years later published Miescher's 
paper together with others describing the isolation of 
nuclein from various sources. Miescher floated the idea 
that nuclein might be the genetic material, presumably 
because it was enriched in sperm, but vacillated on the 
issue.” Nuclein was later found to have an acidic nature 
and was named ‘nucleic acid’ (German 'nucleinsáure") by 
his student Richard Altmann in 1889565559 

Although Miescher's nuclein was later shown to be 
deoxyribonucleic acid (DNA), the ‘nuclein’ isolated from 
yeast by Hoppe-Seyler was not (mainly) DNA but the first 
description of what would later become known as ribo- 
nucleic acid (RNA).? 

In 1882, Walther Flemming described fibrous struc- 
tures in the nucleus and their separation in mitosis, which 
he visualized by staining and termed 'chromatin' (stain- 
able material).9-9? Along with Edouard Van Beneden, he 
also described centrosomes$?9^^ (see Chapter 15), a term 
introduced by Theodor Boveri in 1888.6566 Later that 
year, Heinrich Waldeyer termed the nuclear fibers ‘chro- 
mosomes' (stainable bodies) (Figure 2.1).62.67 

In the years leading up to the turn of the 20th century, 
Albrecht Kossel showed that nucleic acids are an inherent 
component of chromatin and identified their constituent 
nucleoside bases: the *purines'? guanine (G) and adenine 
(A), which have a double ring structure; and the “pyrimi- 
dines' cytosine (C), thymine (T) and (what would be later 
recognized as its RNA counterpart) uracil (U), which 
have a single ring structure.?456.68,69 


= The first component of nucleic acids described was in fact the RNA 
nucleoside, inosine, by Justus von Liebig in 1847, which he isolated 
from beef broth and named ‘inosinic acid’. The ‘umami’ savory 
(‘brothy’ or ‘meaty’) taste (one of the five basic culinary tastes, the 
others being sweet, salty, sour and bitter) derives from a combination 
of the amino acid L-glutamate and ribonucleotides such as guanosine 
monophosphate and inosine monophosphate.?? 

The first purines discovered were caffeine and theobromine (found 
in chocolate). Purines, pyrimidines and related metabolites are 
remarkably versatile compounds, used widely in biology, not just as 
nucleic acid constituents but also as energy currency, such as ATP, 
GTP (guanosine triphosphate) and NAD (nicotinamide adenine 
dinucleotide), regulatory/signaling molecules (such as cyclic AMP) 
and protein modifier (such as ADP-ribose). 
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FIGURE 2.1 Drawings of chromosome segregation during mitosis by Walther Flemming." 


CHROMOSOMES AS THE MEDIATORS 
OF GENETIC INHERITANCE 


Gregor Mendel’s principles of inheritance involving binary 
combinations of simple genetic traits were re-discovered 


around 1900 by the botanists Hugo DeVries, 


Carl 


Correns, and brothers Armin and Erich von Tschermak- 
Seysenegg,””-? and promulgated by William Bateson, who 
translated Mendel’s 1865 paper in 1901. Bateson also 
coined the word ‘genetics’ (from the Greek genno, yevvó; 
‘to give birth’) and many other genetic terms that are still in 
use, such as ‘homozygote’ and ‘heterozygote’. 


The Genetic Material? 


Around the turn of the century, Carl Rabi and Boverii 
described chromosome territories, nuclear transplanta- 
tion in sea urchins and abnormal chromosomes in cancer, 
leading them to conclude that developmental differentia- 
tion is a consequence of regulatory structures within the 
hereditary material.76-7? 

Based on these and other observations, William 
Sutton proposed in 1903 that chromosomes “constitute 
the physical basis of the Mendelian law of heredity" and 
that “one of the most characteristic features of chroma- 
tin is a large percentage content of highly complex and 
variable chemical compounds, the nucleo-proteids”,80 
which had been first described by William Halliburton in 
1895.8! Shortly thereafter, Wilhelm Johannsen coined the 
terms “gene’, ‘genotype’ and ‘phenotype’, although he did 
not speculate about the nature of the gene.8?-83 

Direct evidence for the Chromosome Theory of 
Inheritance was first provided by the discovery of 'sex 
chromosomes' independently by Nettie Stevens and 
Edmund Wilson in 1905.8485 In 1914, chromosomes were 
conclusively demonstrated to be the entities that carry 
genes! by Calvin Bridges*9*7 working in the laboratory of 
Thomas Hunt Morgan, alongside Alfred Sturtevant and 
Hermann Muller, using the powerful fruitfly (Drosophila 
melanogaster) genetic system* that they had developed.** 

The previously conceptual genes had found a physi- 
cal home, a 'locus. Morgan, Bridges, Muller and 
Sturtevant proposed that genes are linearly arranged on 
chromosomes like beads on a string and suggested that 
new combinations arise by ‘crossing-over’ (which was 
observed microscopically) and exchange of genetic mate- 
rial between pairs of chromosomes at meiosis, termed 
*recombination'.55-?! On the other hand, Muller's studies 
identified many loci ('allelomorphs")? that later turned to 
encode regulatory elements, including transposon inser- 
tions (Chapter 10) (Figure 2.2). 


i Theodor and Marcella Boveri also observed that the cytoplasm 
played a role in hereditary processes and proposed that it was the 
interaction of cytoplasm and chromosomes that determined the 
development of an organism, although this interaction was not sub- 
jected to in-depth genetic analysis for nearly a century.” 

i Initially, an X chromosome-linked recessive white eye mutation that 
affected males. 

* Drosophila provided an ideal experimental system for genetic 
analysis, as it is easily maintained in the laboratory and has a short 
generation time. Its importance as a model for animal development 
throughout the 20th century cannot be overstated, despite the ini- 
tial skepticism of medical researchers about its relevance to human 
biology, which largely evaporated later when gene sequencing 
revealed that the genes controlling development and neural function 
in Drosophila have equivalents in humans. Many genes involved 
in Drosophila development are also involved in cancers (Chapters 6 
and 14). 
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Analysis of the physical relationships between genetic 
markers (such as eye color or developmental mutations) 
relied on measuring the frequency of their co-inheritance 
in genetic crosses. Co-segregation of markers occurs in 
5096 of the progeny if the markers are on different chro- 
mosomes (or far apart on the same chromosome), but at 
a higher frequency if the loci are located near each other 
and separated only by occasional recombination events. 

These inheritance patterns gave rise to the concepts 
of gene ‘linkage’! and ‘linkage groups’ (i.e., genes con- 
nected on the same chromosome), and “genetic distance” 
(the frequency of recombination between linked genes, 
measured in “centimorgans”, or cM"), It also led to the 
description of phenotypic differences due to ‘cis’ interac- 
tions between genes located nearby on one chromosome; 
“trans” interactions between genes located distally or on 
different chromosomes; homozygous and heterozygous 
variants, where the same or different variants (‘alleles’) 
are present on the parental chromosomes; ‘recessive’ 
mutations, where one copy of the *wildtype' allele is suf- 
ficient for function and masks the presence of the dam- 
aged variant; ‘dominant’ mutations, where the mutant 
allele overrides the ‘normal’ gene; and partial ‘pene- 
trance’ where the frequency of a phenotype differs from 
normal Mendelian ratios, due to *haploinsufficiency' or 
the influence of ‘modifier’ genes. 

The genetic maps resulting from such crosses led to 
the conclusion that genes are discrete objects with exclu- 
sive borders — a perception that reflects the low resolution 
of these early studies (and many others to this day). It also 
led to the idea of the ‘gene for ...”, not just physical traits 
like eye color or, later, human genetic disorders such as 
cystic fibrosis (Chapter 11), but ultimately also for psycho- 
social traits, as if all genetic influences could be viewed 
in binary terms (genes -> traits), which underpinned the 
subsequent ‘one gene — one enzyme (protein) — one func- 
tion’ assumption in biochemistry.?^95 

The view of genes as particulate entities was rein- 
forced by the work of Nikolay Timoféeff-Ressovsky, 
who attempted to measure the “radius” of genes,% which 
heavily influenced Max Delbrück, who co-authored their 
‘Classical Green Pamphlet’ in 1935,” which was “the 
starting point" for Erwin Schródinger's subsequent rumi- 
nations on the physical nature of genetic information (see 
below) and the “keystone in the formation of molecular 
genetics”.% 


! Linkage between genes (‘partial coupling’) was first observed by 
Bateson and Reginald Punnet in sweet peas in 1904 (see%). 

m A unit corresponding to a recombination frequency between linked 
loci of 1% per generation. 
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FIGURE 2.2  Morgan's diagram of Drosophila melanogaster chromosome linkage groups and genetic maps determined by 
recombination frequency, which "cannot be represented in space in any other way than by a series of points arranged in a line like 
beads on a string”. Figure 28 in his 1922 Croonian Lecture on the mechanism of heredity.” (Reprinted with permission from The 


Royal Society.) 


However, Muller had shown by X-ray mutagenesis in 
1930 that rearrangements that moved active genes near to 
heterochromatic regions (Chapter 4) of Drosophila chro- 
mosomes resulted in changes in the pattern of expres- 
sion of these genes,” called “position effect variegation’ 
(PEV).?? This was also observed by Barbara McClintock 


and others in maize, in that case with transposition under- 
lying PEV (Chapter 5). 

These observations led Muller and Richard 
Goldschmidt to challenge the conception of genes as dis- 
tinct entities.!0?-10? Nonetheless, the view of the gene as a 
discrete unit persisted even in the face of later discoveries 
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that many genes (i.e., DNA segments that express RNA 
products) overlap or reside within other genes, ?-!!0 
which has led to difficulties in genome annotation and 
a barrier to understanding the genome as an information 
continuum (Chapters 13-16). 

Goldschmidt also coined the term ‘phenocopy’ to 
describe morphological changes in Drosophila that 
could be induced by imposition of stress during devel- 
opment, insisting on this basis, to little avail, that gene 
function had to be considered in developmental context!!! 
(Chapter 5). Goldschmidt was considered a heretic by the 
neo-Darwinists.!%2112 


THE MODERN SYNTHESIS 


During the first four decades of the 20th century, evo- 
lutionary biologists and geneticists, notably Bateson, 
Haldane, R. A. (Ronald) Fisher, Sewall Wright, Ernst 
Mayr, G. Ledyard Stebbins, Theodosius Dobzhansky" 
and Julian Huxley, among other theoreticians of the 
time — collectively referred to as the ‘neo-Darwinists’ — 
brought together Mendelian inheritance, Darwinian 
gradualism and selection, and statistical genetics into 
what is known as the ‘Modern Synthesis’, after the title 
of Huxley's 1942 book.!? These pioneers of theoretical 
population genetics introduced important concepts such 
as the relevance of population size, genetic drift and the 
strength of selection, in parallel developing statistical 
methods and models that have found widespread applica- 
tions to this day.!!6!!” 

A tenet was that all evolutionary phenomena and spe- 
cies diversity can be explained in a way consistent with 
known genetic mechanisms,” with the unifying theme 
that natural and artificial selection operates on heri- 
table variation arising by random mutation.? Biometric 
population approaches (which showed continuous trait 
variation) and Mendelian inheritance were reconciled 
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Dobzhansky also worked with Morgan. He showed that natural pop- 
ulations had much greater genetic variability than assumed in previ- 
ous models!? and coined the famous aphorism “Nothing in biology 
makes sense except in the light of evolution”.!!4 

Some, notably Cyril Darlington, recognized the limitations of the 
classical gene and argued that the conception of genotypes as the 
sum of genes could not explain variation in animal and plant popu- 
lations, which must be dependent on interactions among genes and 
between the genotype, the cellular machinery, the reproductive habit 
and environment of the organism.I!8119 

In the 1920s, Haldane analyzed the famous textbook example of 
the appearance of dark pigmentation in peppered moths during the 
Industrial Revolution and established that evolution could occur 
even faster than contemporaries such as Fisher had assumed.!% It 
was much later shown that this occurred by a transposon insertion 
that altered gene expression?! (see Chapters 5 and 10). 
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by Fisher and others by invoking polymorphic multi- 
factorial traits and the ‘infinitesimal mode! of variation 
and selection. ?.115.1227124 

A corollary of the supposition that mutation occurs 
randomly is that experience cannot influence the charac- 
teristics of the next generation, contrary to the proposal 
of Jean-Baptiste Lamarck a century before Darwin.‘ 
As evidence against ‘Lamarckian evolution’, evolution- 
ary biologists recalled the 19th-century work of August 
Weismann,'” who asserted that somatic cells only 
received a subset of the information contained in the 
germline, and that information does not flow in reverse 
from somatic to germ cells and (therefore) cannot be 
transmitted to the next generation." The latter came to be 
known as the ‘Weismann Barrier”,'28 but was challenged 
in the 1940s and 1950s by Conrad Waddington (the father 
of epigenetics, Chapter 14), who provided evidence of 
the inheritance (‘genetic assimilation’) of characteristics 
acquired in response to environmental perturbation'??-P?! 
(Chapter 5). 

The ideas introduced by the Modern Synthesis had a 
lasting influence on the conceptual landscape of molecu- 
lar biology and the interpretations of experimental obser- 
vations. These included the later gene-centric models of 
genome variation, ‘fitness’ and evolution, and the invoca- 
tion of ‘junk’ DNA (Chapter 7), as well as the concept 
that the sequences of important genes are ‘conserved’ 
during evolution, which has frequently been assumed 
in efforts to discriminate functional and non-functional 
regions of genomes (Chapter 11). 

There was considerable cross-fertilization in the 1940s 
and 1950s between theoretical evolutionary biologists 


3 Darwin himself was agnostic about the origin of the variations 
upon which he posited selection to act and was not antagonis- 
tic to Lamarck's ideas. In Chapter V of The Origin of Species, he 
remarked (making an interesting distinction between artificial selec- 
tion and natural evolution): “I have hitherto sometimes spoken as if 
the variations—so common and multiform in organic beings under 
domestication, and to a lesser degree in those in a state of nature— 
had been due to chance. This, of course, is a wholly incorrect expres- 
sion, but it serves to acknowledge plainly our ignorance of the cause 
of particular variation."?5 As pointed out by Devon Fitzgerald and 
Susan Rosenberg: “He also described multiple instances in which 
the degree and types of observable variation change in response to 
environmental exposures ... Darwinian evolution, however, requires 
only two things: heritable variation (usually genetic changes) 
and selection imposed by the environment. Any of many possible 
modes of mutation—purely ‘chance’ or highly biased, regulated 
mechanisms—are compatible with evolution by variation and 
selection."!26 

Part of the evidence was Weismann's peculiar experiment in 1868 
wherein he severed the tails of mice for five generations and showed 
that this experience had no effect on the presence or length of tails in 
the descendants.'?? 
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and microbial geneticists, notably Delbriicks and Joshua 
and Esther Lederberg, assessing lethal/competence 
phenotypes in bacteria and their viruses, which would 
lend support to the spontaneous and random nature of 
mutations as opposed to ‘pre-adaptations’ or “directed 
mutations’!*4 — a debate at the time. 

Equally if not more importantly for the understand- 
ing of genetic programming in the following decades, the 
emphasis on lethal loss-of-function mutations overlooked 
the differences between essential genes and variations in 
regulatory (non-protein-coding) sequences that may have 
no effect on the viability of plants and animals but are 
the major drivers of quantitative trait variation, adaptive 
evolution, survival and reproductive success (Chapters 7 
and 11). 

These assumptions — that inheritance occurs entirely 
through Mendelian genes and mutations occur randomly — 
became entrenched before virtually anything was known 
about the nature of genetic variations and their phe- 
notypic impact, especially in relation to the “complex 
traits that have large environmental components, and 
resulted, in part, in the lack of appreciation of Barbara 
McClintock's later work on mobile genetic elements 
(Chapter 5). They also underpinned the dichotomy of the 
classical and historical ‘Nature versus Nurture” debates 
that raged for decades, not realizing or even considering 
that such complex phenotypes may be the integrated out- 
comes of overlapping genetic and epigenetic processes. 

The concept of genetic determinism overflowed into 
sociological and political arenas, notably the common 
and at the time fashionable idea" (notably, extending 
from observations of selective breeding in agriculture 
and of domesticated animals by Darwin and others) that 
there are superior and inferior human characteristics 
(especially intelligence) that vary between individuals 
and “races”!% and might be improved by selective breed- 
ing (promoted by Francis Galton, who coined the term 
“eugenics”) or, worse, by sterilization, as occurred in the 
USA and by genocide in Nazi Germany. Genetics was 
also politicized in Stalinist Russia by Trofim Lysenko, 


Delbriick insisted that the primary problem in biology was to dis- 
cover the physical structure of genes.!? At one point, he estimated 
that the size of a gene was 1,000 atoms.!33 

An expression also coined by Galton, based on studies of twins,!%5 
a common approach used to this day to assess the relative contribu- 
tions of inheritance and environment to complex traits. 

Although scientific racism and eugenics was fashionable in many 
societies and intellectual circles, important thinkers and scientists 
opposed those ideas, including Dobzhansky and Alfred Russel 
Wallace, the co-discoverer of evolution by natural selection, who 
stated that Galton's eugenics was ‘impractical, ineffective or 
immoral”.!36 
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who favored Lamarckian evolution in line with commu- 
nist ideology, which led to the abolishment of population 
genetics and the persecution of geneticists and other sci- 
entists in the USSR.!38,139 


DISTINGUISHING DNA AND RNA 


During the same period, research on nucleic acids was sur- 
prisingly limited, despite the finding that they are a major 
component of chromosomes, the repositories of genetic 
information.!* Indeed, nucleic acids barely feature in the 
history of biochemistry before 1940,% a consequence of 
the prevailing view that proteins have greater chemical 
diversity than the assumed monotony of nucleic acids, 
thought to be merely structural or metabolic entities. 

This view was strengthened by the ‘tetranucleo- 
tide hypothesis’ put forward in 1909 by the chemist 
Phoebus Levene, who identified ribose sugars in ‘yeast 
nucleic acid" 4^-!? and deoxyribose sugars in calf thymus 
gland.!%-1% Levene proposed that phosphate-sugar-base 
units formed circular tetramers containing each of the 
four “nucleotides”, based on the perception that they are 
present in equimolar proportions.5%12,145-147 

Levene’s work in identifying the five common 
nucleotides (A, G, C, T/U) and the phosphodiester 
bonds between them was a major advance,!%1% but the 
tetranucleotide hypothesis implied that nucleic acids 
could not carry complex information. However, dur- 
ing the 1920s and 1930s, Einar Hammarsten showed 
that the molecule later recognized as DNA has a very 
high molecular weight, using a gentler method for its 
isolation that avoided harsh treatment with alkali." 
Robert Feulgen (who discovered the eponymous DNA 
stain, see below), as well as Hammarsten and his stu- 
dent Erik Jorpes, then demonstrated that nucleic acid 
(RNA) from pancreas contained more guanine than 
other bases.5+1% 

These observations did not fit with a simple equimolar 
tetranucleotide structure (although Jorpes tried to ratio- 
nalize the guanine excess within a pentanucleotide struc- 
ture!*), but were not recognized and Levene's hypothesis 
held sway for decades.1%0:14%5147 In addition, there was a 
perception that the different types of nucleic acids do not 
occur in all organisms, which argued against universal 
biological roles. One form, the 'thymonucleic acid' or 
‘zoonucleic acid’ had been isolated from animal glands, 
pus cell nuclei and sperm and was initially thought to 
be present exclusively in animals; the other, known as 
‘pentose nucleic acid’, ‘zymonucleic acid’, “yeast nucleic 
acid’ or (plant) “phytonucleic acid’, was thought to be 
absent from animal cells until the early 1940s.54150,151 
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Thus, there was confusion about the distribution and 
functions of what later became known as DNA and RNA. 

It was also shown during the same period that DNA 
and RNA have differences in their nucleoside base com- 
position (with DNA containing adenine, guanine, cyto- 
sine, thymine, whereas RNA contains adenine, guanine, 
cytosine and uracil, the demethylated analog of thymine) 
and in the sugars in their sugar-phosphate backbone, 
deoxyribose in DNA and ribose in RNA, hence 'deoxyri- 
bonucleic acid’ and ‘ribonucleic acid"? 

Erwin Chargaff, who later observed the purine-pyrim- 
idine equivalences that underpinned the base pairing 
critical to the elucidation of the structure of DNA and the 
templating of messenger RNA, pointed out in 1950 that 


Although only two nucleic acids, the deoxyri- 
bose nucleic acid of calf thymus and the ribose 
nucleic acid of yeast, had been examined analyti- 
cally in some detail, all conclusions derived from 
the study of these substances were immediately 
extended to the entire realm of nature; a jump of 
a boldness that should astound a circus acrobat.!5 


In 1924, Feulgen and Heinrich Rossenbeck developed 
a sensitive histochemical reaction that intensely stained 
DNA.'^ It demonstrated that DNA is present in the 
nucleus, but not the cytoplasm," of plants and animals, 
and constituted important evidence for DNA as the pos- 
sible genetic material.'5 Jean Brachet, perhaps the most 
important of the early RNA biochemists, developed new 
methods to differentiate DNA and RNA in the 1930s 
and early 1940s, using basic dyes and cytochemical 
staining, later combined with ribonuclease treatments 
(Figure 2.3).56-16? His work showed that both DNA and 
RNA are universal constituents of animal and plant cells 
and that RNA, unlike DNA, is mainly located in the 
cytoplasm. Later, it was shown in bacteria that, while the 
DNA composition varies between different species, both 
DNA and RNA are always present.!6^.162 

Although DNA was linked with chromosomes, the 
function of RNA was for a long time a mystery. It was 
speculated that RNA might serve as an energy repository 
or a cytoplasmic precursor converted into DNA during 
cell division (the “conversion hypothesis"). It was also 
unclear what form RNA might take, since the 2’ hydroxyl 
group of ribose represented a potential bifurcation point, 
such that RNAs could exist as branched molecules. Only 
later, in the 1950s, was this idea dispensed with, when 
studies by Alexander Todd and others demonstrated that 


Y Leaving aside the small amount of DNA later shown to be present in 
mitochondria and chloroplasts. 
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RNAs are linear polymers" whose nucleotides, as in 
DNA, are linked by 3'-5' phosphodiester bonds.65.166 
Confirmation of the role of RNA as an information 
molecule emerged from the study of plant and animal 
viruses that contained RNA but not DNA. In 1937, 
Frederick Bawden and Norman Pirie identified RNA in 
tobacco mosaic virus (TMV),!*' despite the previous sug- 
gestion by Wendell Stanley, who had crystallized TMV, 
that protein was the active ‘auto-catalytic’ agent of the 
virus.!% Later experiments revealed that the TMV and 
other plant viruses, as well as foot-and-mouth disease 
virus and other animal viruses, contained RNA as the 
sole nucleic acid, which indicated that RNA must act the 
template for viral replication and protein synthesis.!6?.170 


ONE GENE-ONE PROTEIN AND 
THE “NATURE OF MUTATIONS" 


The association between 'genes' and proteins dates back 
to 1902, when Archibald Garrod provided evidence for 
the inheritance of human disorders and phenotypes that 
reflected enzymatic deficiencies, such as alkaptonuria."! 
The causes of the enzyme deficiencies in such ‘inborn 
errors of metabolism’ (which included other congeni- 
tal disorders, such as albinism) were still only vaguely 
defined,!”? as they preceded the conceptual terms ‘gene’ 
and “genome”.* 

Indeed, even George Beadle and Edward Tatum's 
influential 1941 “one gene — one enzyme’ principle, 
based on studies of mutants of the filamentous fungus 
Neurospora crassa (red bread mold) (Figure 2.4), was 
not well accepted before the nature of hereditary mate- 
rial was understood."* The phrase subsequently morphed 
into ‘one gene — one protein’ when it was realized that not 


w On the other hand, the 2' hydroxyl of RNA increases its ability to 
form three-dimensional structures through hydrogen bonding.!%:!6+ 
The word “Genom” was coined before the nature of the genetic 
material was known, by the botanist Hans Winkler in 1920: “I pro- 
pose the expression Genom for the haploid chromosome set, which, 
together with the pertinent protoplasm, specifies the material foun- 
dations of the species ..." (in his book Verbreitung und Ursache der 
Parthenogenesis; see 173). 

The concept had earlier roots. Lucien Cuénot showed at the turn of 
the 20th century by his studies of mouse coat color variation that 
Mendelian inheritance occurred in animals as well as plants. He con- 
cluded that *mnemons" (genes) are responsible for the production 
of enzymes and is credited with the first enunciation of the gene — 
enzyme concept. ^-"5* Among other important findings, he also 
described the first alleles and lethal mutation in mice, the recessive 
yellow agouti allele. This same locus was used a century later by 
Emma Whitelaw and David Martin to show the epigenetic inheri- 
tance of metastable alleles determined by transposable elements!7%180 
(Chapters 5, 14 and 17). 
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Fig. 29. Formation of atypical ectoderm in an amphibian egg treated with 
ribonuclease at the morula stage (Brachet and Ledoux, 1955). 


Fig. 30. Formation of a nervous system in an amphibian egg treated with a 
mixture of RNA and ribonuclease at the morula stage (Brachet and Ledoux, 
1955). 


FIGURE 2.3 Jean Brachet's photographs of the perturbation of amphibian morphogenesis by RNase treatment. (Reproduced 
from Brachet,'® with permission from Elsevier under Creative Commons 4.0 license.) 
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all proteins are enzymes, and even this changed when it 
was found much later that, in the higher eukaryotes, one 
gene can produce variations of the same protein by alter- 
native splicing (Chapter 7). 

Beadle and Tatums' work, nevertheless, united genet- 
ics and biochemistry?^!*! It also popularized the use of 
microbes as model organisms.!% In 1946, Tatum and one 
of his students, Joshua Lederberg (with whom he and 
Beadle later shared the Nobel Prize)? demonstrated genetic 
recombination in the enteric bacterium Escherichia 
coli,'$* which consequently became widely used as a 


X-RAYS on 
ULTRAVIOLET 


| ES 
| etra se d S 


19 


model organism. Moreover, most genetic studies were 
focused on lethal, conditionally lethal (such as in auxot- 
rophy? or temperature-sensitivity) and/or phenotypically 
severe mutations, especially in haploid cells (bacteria and 
haploid Neurospora and yeast), which are overwhelm- 
ingly biased to protein-coding sequences (Chapter 7). 

At the time, proteins were thought to comprise genes, 
rather than be the products of them. DNA was consid- 
ered to have only peripheral functions, such as serving 
as an “intra-nuclear buffer”,'% or acting as a scaffold 
during gene replication. The latter was championed 
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FIGURE 2.4 George Beadles’ ‘lantern slide’ explaining the procedure he and Edward Tatum used to isolate metabolic mutants 
of the filamentous fungus N. crassa, showing complementation and Mendelian segregation of the affected gene. (Reproduced 


from Horowitz?! with permission from Oxford University Press.) 


z Apparently overlooking the contributions of his wife, Esther.'** 


aa The inability of an organism to synthesize a particular organic com- 


pound required for its growth in minimal media. 
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by the small community of scientists then engaged in 
nucleic acids research, such as Hammarsten and Torbjórn 
Caspersson.'* Caspersson, whose work was also instru- 
mental in establishing the involvement of RNA in pro- 
tein synthesis, proposed a metabolic relationship between 
nucleic acids and “gene reproduction", noting that “syn- 
thesis of nucleic acid is closely connected with gene 
reproduction", and that “it may be that the property of a 
protein which allows it to reproduce itself is its ability to 
synthesize nucleic acid"? However, his conception was 
that the “structure-forming properties" of DNA (allud- 
ing to the high molecular weight DNA polymers earlier 
demonstrated by Hammarsten) was simply auxiliary to 
the basic proteins of the nucleus.!+0185 

Research on genetic material up until World War II, 
therefore, concentrated on proteins as the prime candi- 
date for the genetic material, the “protein version of the 
central dogma”.** Even the first reported infectious agents 
of bacteria, characterized as “filter-passing viruses" (later 
termed bacteriophages or ‘phages’ for short, soon to play 
a major role in the understanding of genes, Chapter 3), 
were considered to be “enzymes with power of growth”.!86 


DNA IS THE GENETIC MATERIAL 


In 1944, the physicist Erwin Schródinger wrote a book 
entitled What Is Life?, in which he made the logical 
deduction that the genetic material would be comprised 
of an "aperiodic crystal" — that is, a molecule of regu- 
lar structure with information embedded in its fine-scale 
variations — a “miniature code", the first use of the term 
‘code’ in relation to biology. 

Schródinger's prediction was borne out in the same 
year? by Oswald Avery, Colin MacLeod and Maclyn 
McCarty, who built on the studies of bacterial *transfor- 
mation’ by Fred Griffith in the late 1920s? and showed 
that the change from a benign to a virulent form of the 
bacterium Streptococcus pneumoniae could be effected 
by DNA but not by protein.5*1% Their finding was con- 
firmed in E. coli a year later by André Boivin.?! 

However, such was the entrenchment of prior expecta- 
tions that it took almost 10 years for the conclusion to be 
widely accepted,?? and “even Avery himself was reluctant 
to accept it until he had buttressed his experiments with the 
most rigorous controls".^^ Arguing that a small amount of 
contaminating proteins could be present in Avery's prepa- 
rations, not only Hammarsten, but also other eminent 
scientists, especially Alfred Mirsky, were unconvinced 


^^ And later by the structure of DNA, with Schródinger's prediction 
explicitly recognized by Francis Crick (see 188). 
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that DNA was the genetic material.1%014%19%3 James Watson 
said, in retrospect, that (unfortunately) “most people didn't 
take him [Avery] seriously".?^ Moreover, as Gunther 
Stent observed later,?? Avery's finding was "premature". 
General acceptance only came after the 1952 experiments 
by Alfred Hershey and Martha Chase that showed the 
uptake of ?P-labeled DNA, but not ?S-Iabeled proteins, 
into bacteria after infection with bacteriophage T4,/99.95 
and the elucidation of the structure of DNA in 1953. 
Levene's tetranucleotide structure was finally and 
convincingly refuted by Chargaff's demonstrations in 
the late 1940s that DNA "formed extremely viscous 
solutions in water" (confirming Hammarsten's earlier 
observations), implying a structure much larger than a 
tetranucleotide, and by paper chromatography“ that its 
constituent bases were not present in equimolar propor- 
tions.' Instead, Chargaff showed that the pyrimidine 
(T, C) and purine bases (A, G) are present in equal 
amount in DNA, and that A and T, as well as G and C, 
occurred in the same proportions.** That is, %G=%C and 
%A=%T, and that the ratios of these pairs of nucleotides 
(%G+C/%A+T) were the same in different tissues of the 
same organism, but varied between organisms.?07-209 


THE DOUBLE HELIX — ICON 
OF THE COMING AGE 
Chargaff's data was crucial for Watson and Crick's inter- 


pretation of the X-ray diffraction patterns of DNA fibers 
obtained by Rosalind Franklin" and Raymond Gosling, 


ec Stent said “A discovery is premature if its implications cannot be 
connected by a series of simple logical steps to canonical, or gener- 
ally accepted, knowledge.”!°? 

4 While first reported in 1925,95 paper and ion-exchange chromatog- 
raphy was also detecting modified bases in DNA and RNA,?^/9* 
although the import of these modifications was not appreciated until 
much later (Chapters 14 and 17). 

ee Chargaff later showed that the %G=%C and %A=%T in single- 
stranded bacterial DNA.!” Chargaff's Second Parity Rule has since 
been shown to hold for all double-stranded genomes, except mito- 
chondrial DNA, over a scale of kilobases (E. coli) and megabases 
(human), due to abundant inverse symmetries thought to reflect the 
distribution of repetitive elements,20%2% but may reflect the preserva- 
tion of RNA secondary structure in the encoded transcripts (includ- 
ing in repetitive elements) (Chapter 16). 

ff Franklin was initially skeptical that DNA had a helical structure, 
as evidenced by her comments in a satirical note, in the style of an 
in-memoriam card, sent to Maurice Wilkins in July 1952: "It is with 
great regret that we have to announce the death, on Friday 18th July 
1952 of DNA helix ... A memorial service will be held next Monday 
or Tuesday" The card is reproduced in Brenda Maddox's book 
‘Rosalind Franklin: The Dark Lady of DNA’, p. 185.210 
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(a) Photograph 51 of B-DNA. X-ray diffraction photograph of a DNA fiber at high humidity (Franklin and Gosling, 


1953).?5 Interpretation of the helical-X and layer lines added in blue. (b) Watson-Crick model of B-DNA. (Adopted from Watson 
and Crick,?'* with the helical repeat associated with the layer lines labeled. Reproduced from Shing Ho and Carter? with permis- 


sion from the authors under Creative Commons 4.0 license.) 


building on work by others, notably Bill Astbury 
and Florence Bell who more than a decade earlier had 
determined the planar structure of, and the ~3.4 ang- 
strom spacing between, the bases (stacked like “a pile of 
plates”),21%213 which led to the elucidation of its double- 
helical structure.21%216 

As is well known, and was enormously compelling at 
the time,?" although it took a few years for its significance 
to be widely appreciated,?!* this structure is governed by 
nucleotide base (purine-pyrimidine) pairing rules, which 
immediately suggested a mechanism for the duplication 
of genetic information.?^ This was subsequently demon- 
strated in 1958 by Matthew Meselson and Franklin Stahl 
using labeling with heavy isotopes of nitrogen to distin- 
guish the template from newly synthesized DNA strands 
in buoyant density gradients.?? An enzyme mediating the 
synthesis of new, complementary strands was discovered 
by Arthur Kornberg and colleagues in 1956 and termed 
DNA-dependent DNA polymerase.220-222 

Not so well known is that John Masson Gulland, Denis 
Jordan and colleagues had shown in 1947 that DNA is 
held together by hydrogen bonds,?? and that their PhD 
student James Creeth had, in his 1948 PhD thesis, pro- 
posed a model for the structure of DNA comprising two 
chains with a sugar-phosphate backbone on the exterior 


se The considerable back story of the application of X-ray crystallogra- 
phy to understanding the structure of DNA (and proteins) is laid out 
in Gareth Williams’ book *Unravelling the Double Helix: The Lost 
Heroes of DNA?! 


and hydrogen-bonded bases between the nucleotide bases 
of opposite chains in the interior (Figure 2.5).224225 

Acceptance of the DNA as the genetic material and 
the significance of its double-helical structure in genetic 
inheritance and gene expression came only after con- 
cerns about the ability of its strands to unwind had been 
resolved by the Meselson and Stahl experiment, and a 
plausible mechanism for the expression of genetic infor- 
mation (i.e., RNA-templated protein synthesis) had been 
established?!522 (Chapter 3). 

A subtle but important feature of the structure of 
DNA"! is that its strands are antiparallel, with both strands 
going from 5’ to 3’ (with respect to the phosphate linkages 
that connect the sugar-phosphate backbone) but in the 
opposite direction to each other, with DNA replication, 
RNA transcription and protein synthesis all proceeding 
from 5’ to 3’ in relation to the sugar-phosphate linkages. 
The linear arrangement of the bases along the backbone 
of the helix was fundamental to the logical deductions and 
experimental approaches to deciphering the genetic code. 


hh Another feature of DNA, often overlooked, is that there are alter- 
native forms of base pairing, notably Hoogsteen pairing, which 
delayed acceptance of the Watson-Crick model.?5 It can exist in dif- 
ferent forms, specifically the A- and B-forms as shown by Franklin 
(the B-form being the classic double helix), the Z-form discovered 
by Alex Rich, and others such as G-quadruplexes and I-motifs, 
which exist naturally in vivo.22%%! It was also later shown that the 
base stacking and helical dimensions vary according to nucleotide 
sequence.?? 
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The principles of copying/reading by base pairing 
and the directionality of the information were critical for 
understanding the synthesis and roles of RNA. Similarly, 
the subsequent demonstration by Julius Marmur, Paul 
Doty and colleagues that complementary strands of DNA 
(and RNA transcribed from the DNA) could recognize 
each other by base pairing?32% played an important part 
in the identification of messenger RNA and in the first 
analyses of genomic sequence composition and complex- 
ity (Chapter 3). 
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THE BIG QUESTION 


Following the double helix, the overarching question was 
how the information in DNA is transduced into proteins. 
It was by then established that, in eukaryotic cells, DNA 
resided in the chromosomes and doubled at cell division. 
These observations were consistent with its role as the 
carrier of genetic information, but not with an involve- 
ment in protein synthesis. 

In the early 1950s, Jean Brachet showed that enu- 
cleated cells could temporarily maintain protein syn- 
thesis, and based on a series of grafting experiments 
with the unicellular green algae Acetabularia, Joachim 
Hammerling showed the existence of morphogenetic 
substances produced “under the influence" of the nucleus 
and transported to the cytoplasm that are "products of 
gene action, which stand between gene and character”? 

RNA gradually emerged as the intermediate. It was 
found to be present in high levels in the cytoplasm, par- 
ticularly in association with the “ergastoplasm” (endo- 
plasmic reticulum) structure?? (Chapter 4), and that its 
levels vary in different tissues and metabolic states. The 
correlation between the amount of RNA and the rate of 
protein synthesis, independently observed in the early 
1940s by Brachet and Caspersson, led them to propose 
that RNA was involved in protein synthesis.>° 


DISCOVERY OF THE RIBOSOME 


Around the same time, using ultracentrifugation to frac- 
tionate mammalian liver cells infected with the cancer- 
causing Rous sarcoma virus, Albert Claude, who was 
the first to isolate the mitochondrion, the chloroplast, 
the Golgi apparatus and the lysosome, observed cyto- 
plasmic granules associated with membranes, initially 
called microsomes. He found that the granules contained 
large quantities of nucleic acids of the “ribose type”, 
which Brachet postulated to be the sites of protein syn- 
thesis.? Later, microsomes were found to correspond 


a Claude was also the first to show, controversially, that the active 
agent in the cancer causing Rous sarcoma virus was not “thymo- 
nucleic acid" but "strongly positive ... for pentoses" (i.e., RNA), 6 
years before Avery.’ Rous sarcoma virus became famous later for its 
role in the discovery of the first ‘oncogene’ (Chapter 6). 
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to microvesicles, arising from fragments of membranes 
from the endoplasmic reticulum, with which ribosomes 
are commonly associated. The granules were visualized 
in 1955 using electron microscopy by George Palade and 
Philip Siekevitz,!! and re-named ‘ribosomes’ in 1958 by 
Richard Roberts,!? in view of their abundant ribonucleic 
acid component (Figure 3.1). 

Ribosomes were initially characterized by their 
physical sedimentation rates, with bacterial ribosomes 
designated as ‘70S’ and eukaryotic ribosomes as ‘80S’, 
terminology that is still in use today. Subsequently it 
was found that ribosomes could be separated into two 
main components, a large ‘SOS’ subunit and a small 
*30S' subunit in bacteria, and equivalently a large ‘60S’ 
subunit and a small ‘40S’ subunit in eukaryotes.!* These 
studies were aided by the use of detergent solubiliza- 
tion, chaotropic agents and phenol extraction techniques 
to isolate intact RNAs,'*6 prior to which most prepara- 
tions contained mainly degradation products due to the 
ubiquity of RNases released in lysed cells and present 
on skin." 

Small ribosomal subunits were found to be composed 
of a number of proteins complexed with an RNA termed 
16S and 18S rRNA in bacteria and eukaryotes, respec- 
tively. The large subunit in bacteria contains two RNAs 
(28S and 5S) whereas the large subunit in eukaryotes 
contains three RNAs (30S, 5.8S and 5S), all complexed 
with proteins. The ribosomal RNAs are transcribed as 
a single large precursor, which is processed to form the 
individual rRNAs, from one operon in bacteria and many 
in eukaryotes, the latter shown later to be tissue- and 
developmental stage-specific,? as are many ribosomal 
proteins.!”20 

It was also shown by radioactive tracing that ribo- 
somes are the sites — the cellular factories» — where amino 
acids are assembled into proteins,?! although the mecha- 
nism was yet to be defined. 

Thus, the original association of RNA with protein 
synthesis was a result of the detection of the most abun- 
dant RNAs (i.e., rRNAs) allowed by the techniques of 


^ As both subunits contain many proteins, RNA was simply thought 
to be the framework for the machine, until it was shown much later 
that RNA in fact lies at the catalytic heart of peptide bond formation 
(Chapter 9). 
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FIGURE 3.1 


Electron micrograph of ‘ribosomes’ associated with the endoplasmic reticulum and free in the cytoplasm of liver 


cells. (Reproduced from Palade and Siekevitz!! with permission from Rockefeller Institute Press.) 


the time. This abundance obscured the far more complex 
population of other RNAs that are expressed from the 
genome, as later the relatively high abundance of mes- 
senger RNAs also obscured the presence of equally if 
not more complex populations of cell-specific regulatory 
RNAs in plants and animals (Chapters 12 and 13). 


THE MESSENGER AND THE ADAPTOR 


The discovery of messenger RNA (mRNA), and its tem- 
plating of protein synthesis in ribosomes by interaction 
with adaptor molecules, was science at its best, involv- 
ing the interplay of observation, logical deductions, 
discussions and ingenious experiments by many indi- 
viduals, notably Francois Jacob, Jacques Monod,‘ Crick, 
Watson, Sydney Brenner, Marshall Nirenberg and their 
collaborators.?* 

The dominant hypothesis that emanated from the 
1940s to explain protein biosynthesis was summarized 
in 1950 by Peter Caldwell and Cyril Hinshelwood: “In 
the synthesis of protein, the nucleic acid, by a process 


* Both Jacob and Monod were eclectic individuals with interesting his- 
tories, including participation in the French Resistance in World War 
II. Between them, in addition to the 1965 Nobel Prize in Physiology 
or Medicine, awards included France's World War II highest deco- 
ration for valour, the Cross of Liberation, as well as the Croix de 
Guerre, the Légion d'Honneur and the American Bronze Star Medal. 
Monod also shared a deep postwar friendship with the writer-philos- 
opher Albert Camus.” 


analogous to crystallization, guides the order by which 
the various amino acids are laid down.”2 

The importance of protein sequence was reinforced by 
Linus Pauling's team's discovery in 1949 that the electro- 
phoretic mobility, and therefore the amino acid composi- 
tion, of hemoglobin is altered in sickle cell anemia, for 
the first time showing the molecular basis of a genetic 
disease,” with the specific amino acid changes in this and 
other mutant hemoglobins subsequently identified.27-% 

It was clear by the early 1950s that the nucleus is 
the source of RNA'?! and that RNA could serve as the 
template for protein assembly, first proposed by André 
Boivin‘ and Roger Vendrely in 1947? and elaborated by 
Alexander Dounce and Brachet in the early 1950s? 
(Figure 3.2). It was also becoming evident that protein 
synthesis requires “the ordered interaction of three 
classes of RNA — ribosomal, soluble, and messenger'?* 
(see below). 

In 1958, as a key part of the theoretical considerations 
of the process of protein encoding by DNA (the “coding 
problem”), Crick proposed the ‘Adaptor Hypothesis’, by 
which some molecule must serve as the carrier for amino 
acid incorporation into peptide chains during protein 
synthesis. Crick postulated that the adaptor was RNA, 
given that “base pairing made RNA uniquely suited for 
a role as a small, specific RNA recognition molecule”.% 


4 André Boivin was one of the earliest and most visionary supporters 
of Avery’s claim that DNA was the hereditary material? 
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Fig. 41. Scheme proposing relationship between DNA, RNA and proteins in 
the different parts of the cell. chr: chromosomes; cy: cytoplasm; n: nucleus; 
no: nucleolus. 
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FIGURE 3.2 Jean Brachet’s speculations on the flow of genetic information. Figure 41 in his 1959 paper on the biological role of 
ribonucleic acids. (Reproduced from Brachet, with permission from Elsevier under Creative Commons 4.0 license.) 


Such RNAs had just been identified by Mahlon 
Hoagland, Paul Zamecnik and colleagues, who showed 
that small soluble ‘sRNAs’ could be conjugated to amino 
acids (labeled with a radioactive carbon isotope,'^C) and 
transfer the labeled amino acids to proteins in micro- 
somal preparations. This reaction required GTP (guano- 
sine triphosphate), later shown to be the energy source for 
peptide bond formation. From this they concluded that 
such RNAs, later named transfer RNAs (tRNAs), func- 
tion as the intermediate carrier of amino acids in protein 
synthesis.^! 

The factory and the adaptor had been found, but the 
template and the ‘code’ remained undefined. With the 
association of ribosomes with protein synthesis increas- 
ingly accepted, one hypothesis was that distinct ribo- 
somes served as templates for different proteins, leading 
to the new aphorism “one gene — one ribosome — one 
enzyme”.*° 


However, given the rapid rates of protein synthesis 
that were observed, for example, after infection of bacte- 
ria with bacteriophages (phages*) and that the two known 
RNA species (rRNAs and tRNAs) were essentially 
homogeneous, stable and similar in different species, this 
hypothesis seemed implausible: these RNA species did 
not fulfill the requirements of dynamic templates for pro- 
tein synthesis.2258.4-47 

Early clues for the intermediate candidate had been 
obtained with the use of nucleotides labeled with a 


* The term given to bacterial viruses, from the Greek meaning “bac- 
teria eater', discovered by Frederick Twort and Félix d'Hérelle in 
1915-1917, coined by the latter and often shortened to ‘phages’. 
The use of bacteriophages was instrumental in the analysis and elu- 
cidation of gene structure, replication and expression, as they com- 
prised an extremely powerful system that could introduce genetic 
changes and poll millions of genetic events in overnight bacterial 
culture." 
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radioactive isotope of phosphorus (?P). In 1953, Hershey 
and colleagues found that, unlike DNA, a small fraction 
of RNA is synthesized "extremely rapidly" following T2 
phage infection. In 1956, Elliot Volkin and Lazarus 
Astrachan reported that while T2 phage infection 
arrested bacterial protein synthesis, it triggered massive 
phage protein synthesis. They also noticed that while cel- 
lular RNA remained essentially unchanged, short-lived 
RNA with the same base composition as the viral DNA 
was contemporaneously produced. Volkin and Astrachan 
called this variant RNA “DNA-like-RNA” and remarked 
that “such RNA molecules may be an entire new spe- 
cies, possibly related to phage growth”.* In 1959, Arthur 
Pardee, Jacob and Monod showed the same rapid induc- 
ibility of short-lived RNA from the lac operon following 
lactose exposure“ (the ‘Pajama’ experiment*!).' 

Others, notably Sol Spiegelman, Benjamin Hall and 
Masayasu Nomura, confirmed and extended these obser- 
vations,?^ which, although not widely recognized at the 
time, were crucial for the discovery of the messenger.225! 
Consequently, Jacob and Monod, in their famous 1961 
paper describing the operon model (see below), postu- 
lated the existence of an "unstable" RNA that conveyed 
the genetic information for protein production to the 
cytosol. The “candidate” (which they first called X") 
was named “messenger RNA” (mRNA).* 

Brenner, Jacob and Meselson were already working 
on this hypothesis? and in the same year proved the exis- 
tence of mRNA using the phage system and incorpora- 
tion of labeled RNA into previously existing ribosomes.“ 
At the same time, Frangois Gros, Walter Gilbert and 
colleagues in Watson's laboratory demonstrated rapid 
turnover of mRNAs and their “DNA-like” base composi- 
tion in bacteria, and three groups demonstrated DNA- 
dependent RNA synthesis in bacteria and in isolated 
nuclei of mammalian cells.56-58 


THE “GENETIC CODE" 


In parallel, attention was turning to the nature of the 
information that instructed the sequence of amino acids 
in proteins. Theoretical considerations of a ‘genetic code’ 
was a major theme in the post-World War II period, espe- 
cially in the so-called ‘RNA Tie Club’, which included 
physicists such as George Gamov and Richard Feynman, 
as well as Crick, Brenner, Martynas Yčas and others, 


f The Pajama experiment also revealed that the induction of beta- 
galactosidase from the lac operon is regulated by a repressor,?? 
which ushered in the concept of the regulation of gene expression.?! 
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founded in 1954 to "solve the riddle of the RNA structure 
and to understand how it built proteins".^?-6! 

An intellectual world away from biochemistry but 
emergent in the background” was the revolutionary work 
in the 1940s on information theory and computational 
control systems connecting machines with biology by 
Claude Shannon? and Norbert Wiener,9^-99 which ushered 
in the digital age and, inter alia, the concepts of genetic 
coding and genetic programming.” 

In 1951, Fred Sanger and colleagues established that 
proteins are linear polymers of amino acids, and pro- 
duced the first protein sequence, that of human insu- 
lin, by partial hydrolysis of the two peptide chains,68! 
an approach whose principles he later applied to RNA 
sequencing (Chapter 6). 

Of central importance was Seymour Benzer's use of 
genetic recombination and temperature-sensitive mutants 
of bacteriophage T4 at the turn of the decade to map the 
fine structure of genes.276 Benzer,” who later went on to 
become a pioneer of behavioral genetics,’ showed that 
genes are linear but not indivisible, using the resolving 
power of his system to identify deletions and nucleotide 
changes, some of which specify a different amino acid 
and others that corrupt or terminate protein synthesis.?? 

Reasonably, then, genes and proteins were presumed to 
be co-linear,?? that is, the order of nucleotides is the same 
as that of their specified amino acids, but it was unknown 
whether the code is overlapping or non-overlapping.i 
The former was considered unlikely on logical grounds 
by Brenner? and experimentally by Akira Tsugita and 
Heinz Fraenkel-Conrat, who showed in 1960 that a point 
mutation resulted in just one amino acid change.*? 

The matter came to a climax in 1961. Crick and others 
reasoned that the length of the coding units (‘codons’) 
must be at least three to be able to specify all 20 amino 
acids that standardly occur in proteins, which in turn 
implied that, if so, there may be more than one ('redun- 
dant’) codon for each amino acid, or at least some of 


* Shannon’s PhD thesis was entitled ‘An algebra for theoretical 
genetics". 

Benzer was later described as the researcher who “more than any 
other single individual, enabled geneticists adapt to the molecular 
age” 

This also led to the common one-dimensional conception of RNAs, 
which have complex three-dimensional structures, which also trans- 
mit information (Chapters 8 and 16). However, only tRNAs were 
explicitly considered to have a “protein-like structure". ?! 

In 1966, following the first determination of the sequence and 
secondary structure of a tRNA (see below), Crick published “The 
Wobble Hypothesis’, which provided a structural explanation for the 
degeneracy of the genetic code.*? 
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TABLE 3. NUCLEOTIDE SEQUENCES or RNA Copons 
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VAL* 
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arag 


FIGURE 3.3 The genetic code for amino acids presented by Nirenberg et al.% (Reproduced with permission of Cold Spring 


Harbor Laboratory Press.) 


them (£ = 16; 4° = 64), with corresponding ‘adaptor RNAs’ 
(tRNAs) linked to cognate amino acids.+0$4 

In 1961, Crick, Brenner and colleagues used Benzer’s 
high-resolution bacteriophage gene system and some of 
his mutants to show that insertion of one or two nucleo- 
tides in the coding sequence resulted in a non-functional 
protein, because it threw the subsequent codons out of 
kilter (‘frame-shift’ mutations), whereas the insertion 
or deletion of three nucleotides had more subtle effects, 
thereby demonstrating that the coding unit was indeed a 
triplet.*! 

In the same year, experiments with RNA homopoly- 
mers in cell-free extracts by Marshall Nirenberg and 
Heinrich Matthaei demonstrated that polyuridine can 
direct the incorporation of the amino acid phenylalanine 
into proteins.5 This not only proved that messenger RNA 
directs protein synthesis, but also provided the platform 
for working out the entire triplet-based genetic code by 
the mid-1960s using combinations of nucleotides in syn- 
thetic RNAs%286-1 Figure 3.3). 

As André Lwoff put it, “the messenger ceased to be an 
étre de raison and became a molecule”,” and the apho- 
rism “one gene — one enzyme’ had found the intermediate. 


THE lac OPERON AND GENE REGULATION 


The year 1961 also saw the publication of Jacob and 
Monod’s classic model of gene regulation in the same 
paper that proposed the existence of mRNAs,*6 based on 
studies of lambda phage infection and the genetic dis- 
section of the /ac *operon' of E. coli. This had a decisive 
impact on the conceptual framework of the regulation 
of gene expression and the protein-centric paradigm of 
genetic information that has dominated molecular biol- 
ogy for most of its history. Every undergraduate student 
of molecular biology is taught the /ac operon as the exem- 
plar of gene regulation. 

The /ac operon consists of three 'structural' genes 
(transcribed as one ‘polycistronic? mRNA containing 
three open reading frames) that specify three proteins 
involved in the uptake and metabolic utilization of the 
milk sugar lactose by the bacteria in the gut, including 
the enzyme beta-galactosidase,* together with a nearby 


* Beta-galactosidase became a favorite target of assays for gene 
expression by linking its coding sequences to presumed regulatory 
elements and assaying its activity using an artificial (‘chromogenic’) 
substrate that produced a blue color in response to enzyme activity. 
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FIGURE 3.4 Jacob and Monod's 1961 models of the lac operon and the regulation of protein synthesis.4° (Reproduced with 
permission from Elsevier.) Note that in both models the /ac repressor is drawn as an RNA. 


‘repressor’ gene, whose product keeps the lac genes 
silent until and unless lactose is present — there is no 
point in producing the enzymes to utilize lactose if none 
is present. 

Jacob and Monod articulated the notion that genomes 
contained both "structural genes", which encoded 
enzymes and other proteins, such as hemoglobin and 
insulin, etc., and “regulator genes”,* which specified 
regulatory systems that control the expression of the 
former. In this model, structural genes obeyed the ‘one- 
gene, one-protein' principle, and regulator genes encoded 
a trans-acting "repressor" (of unknown composition) that 
would interact with other DNA sequences ("operators") 
linked in cis (that is, adjacent to the promoter) to block 
the initiation of transcription, which in turn implied that 


! And perhaps other regulatory genes, in a chicken-and-egg hierarchy, 
especially during the complex suites of gene expression during mul- 
ticellular differentiation and development — see Chapter 15. 


they would generally lie upstream of the target genes." 
There was no consideration of the converse, that there 
may also be activators that operate similarly. It was dis- 
covered later that the product of another gene with a more 
universal activator function makes it easier for special- 
ized sugar utilization genes to be induced if energy levels 
are low (Figure 3.4). 

In any case, the concept that differential gene activ- 
ity underlies cell differentiation was obvious and had 
already been proposed in the 1950s (e.g..??), but whether 
the regulation of gene expression could be explained 
simply in terms of the action of regulatory proteins 
encoded by other genes was uncertain, although begin- 
ning to be widely assumed. The cytogeneticist Barbara 
McClintock intuited that there was a distinction between 


m What was initially called the ‘operator’ is now referred to as the gene 
“promoter”, which encompasses the regulatory sequences upstream 
of the transcribed regions of genes recognized by regulatory factors 
and RNA polymerase. Operator may be the better term. 
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(protein-coding) “gene elements" and (mobile) “control- 
ling elements", based on genetic studies of transposon 
mobilization in maize (Chapter 5). She proposed that 
these mobilized elements, despite not being part of the 
"gene" nor (likely) enzymes, would act as modifiers, sup- 
pressors or inhibitors of gene activity, and predicted their 
general occurrence in other organisms.?* 

Accordingly, McClintock warned against settling 
too quickly on a protein-centric definition of genes (and 
mutations") based on studies in bacteria before the struc- 
ture of DNA or the nature of the controlling factors she 
had found — or any ‘gene’ — were defined. In 1950, she 
wrote in a letter to a colleague: 


Are we letting a [protein-coding] philosophy of 
the gene, control [our] reasoning? What then is 
the philosophy of the gene? Is it a valid philoso- 
phy? ... When one starts to question the reason- 
ing behind the present notion of the gene (held by 
most geneticists) the opportunity for questioning 
its validity becomes apparent.’ 


Moreover, early evidence had indicated that RNAs might 
have other properties, beyond their roles in protein trans- 
lation, of potential importance in genetic transactions. 
Soon after the publication of the double-helical struc- 
ture of DNA, Alexander Rich (a founding member of 
the RNA Tie Club) and David Davies showed that RNA 
molecules could base pair to form double-stranded RNAs 
(dsRNAs),%9 a discovery that was met with some skepti- 
cism or disregard,% although later shown to be a feature 
of cellular RNA interactions??-!?! and regulatory systems 
(Chapters 8 and 12). In 1957, Rich, Davies and Gary 
Felsenfeld showed that RNA can also interact sequence- 
specifically with double-stranded RNA or DNA to form 
three-stranded (triplex) structures,!% via non-canonical 
‘Hoogsteen’ base pairing! in the major groove of the 
helix, which “may have significance as a prototype for 
a biologically important three-stranded complex, as, for 
example, a single ribonucleic acid chain wrapped around 
a two-stranded DNA”! In 1961, Spiegelman and Hall 
showed that RNA-DNA hybrids exist naturally in cells. 

A few years later, Robert Holley, Ada Zamir and 
colleagues showed by the first sequencing of a tRNA 
(alanine tRNA, using partial ribonuclease digestion and 


5 


McClintock told Charles Burnham in January 1950: "Even though 
the details are manifold, obviously, there is a consistency that does 
not fail. You can see why I have not dared publish an account of this 
story. There is so much that is completely new and the implications 
are so suggestive of an altered concept of gene mutation that I have 
not wanted to make any statements until the evidence was conclusive 
enough to make me confident of the validity of the concepts.” 

e Hoogsteen base pairing occurs in tRNAs.!% 
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two-dimensional fractionation?) that RNAs form sec- 
ondary structures via internal base pairing, forming a 
“cloverleaf” structure with double-helical base-paired 
regions when displayed in two dimensions.!% These 
analyses, which took 9 years, also identified ten chemical 
modifications of its nucleotides! (Chapter 17). 

The tRNA structure was confirmed and its canoni- 
cal L-shape revealed almost a decade later by Rich and 
colleagues using X-ray crystallography, the first determi- 
nation of the 3D structure of a natural RNA.'°8! Later 
studies showed that all tRNAs have four hairpin heli- 
ces and three variable loop structures inserted between 
two hairpin structural elements and that the 3’ end of all 
tRNA molecules contain a conserved CCA sequence, 
to which the relevant amino acid is attached by specific 
enzymes (Figure 3.5).!!9.1!! 

These and subsequent structural studies revealed 
unusual structural motifs, non-canonical base pair- 
ing, tertiary interactions, intercalated strands, pseu- 
doknots, coaxial stacking and bound metals, pointing 
to the structural complexity and versatility of RNA 
molecules.108109,113-116 Although the significance of these 
properties of RNAs was cryptic,” the possible regulatory 
implications of DNA-RNA and RNA-RNA interactions 
did not go entirely unnoticed. 

In 1959, Arthur Pardee and Louise Prestidge, and a 
year later Leo Szilard, suggested that RNA would make 
a good candidate for the lac repressor.!?!!5 Jacob and 
Monod subsequently, in their magnum opus on the lac 
operon,* also proposed that RNA may be the agent pro- 
duced by the regulatory gene, emphasizing its sequence 
specificity. In their words: “the operator tends to com- 
bine (by virtue of possessing a particular base sequence) 
specifically and reversibly with a certain (RNA) frac- 
tion possessing the proper (complementary) sequence." 
Because RNA can base pair with DNA and RNA, they 
proposed two models by which RNA could act as repres- 
sors, either at the RNA transcriptional (“genetic opera- 
tor model") or post-transcriptional levels (“cytoplasmic 
operator model", where the operator is present in the 
"polycistronic" transcript). Jacob and Monod favored 
the former.^ This was a special moment for RNA in 
the history of molecular biology, with great conceptual 
implications, but was short-lived, for reasons explained 
below. 

Rich proposed in 1961 that both strands of DNA 
in the cells could potentially template complementary 
(‘antisense’) copies of RNAs, which was shown much 


P Fred Sanger and colleagues developed a similar method to sequence 
RNAs at the same time.!9 
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by Holley et al.!°° (Reproduced with permission of the American Association for the Advancement of Science.) (b) The general 
secondary and (c) spatial L-shaped structure (c) of tRNA. (Reproduced from Suzuki!!? with permission from Springer Nature.) 


later to be a widespread occurrence, especially in ani- 
mal and plant cells (Chapter 13). He speculated that “it 
does not seem likely that both of these [DNA strands] 
go on to manufacture a protein molecule" and that there 
is "an interesting possibility in that this may be part of 
the control apparatus for turning on or off the synthesis 
of a given class of proteins", suggesting that this could 
involve the formation of double-stranded RNA,!!? which 
also turned out (much later) to be correct (Chapter 12). 
Again using similar principles, Kenneth Paigen elabo- 
rated on the operon model in 1962, and speculated that 
diffusible (trans-acting) RNAs produced by regulator 
genes could base pair with the non-template DNA strand 
of structural genes, reversibly regulating the “release” 
of messenger RNAs produced from the template DNA 
strand. ?0 

Paul Sypherd and Norman Strauss offered the pos- 
sibility that the repressor system could involve com- 
plexes with both RNA and protein, wherein proteins 
had specificity for small molecule ligands (in this 
case, lactose) and the RNA provided specificity for the 
operator.!?! The latter (RNA guidance of transcription 
factors and chromatin-modifying proteins to specific 
genomic locations) was a prescient prediction (Chapter 
16). RNA was also later shown to be capable, like pro- 
teins, of binding small molecules and responding allo- 
sterically (‘riboswitches’) to regulate gene expression 
(Chapter 9).4 


4 Even ribosomal RNA was later shown to regulate the expression of 
genes that control development.!?? 


Although it is clear that these models were proposed 
in the absence of knowledge of many of the enzymatic 
components (such as helicases) and mechanisms involved 
in DNA replication and transcription, the recurrent theme 
was that they invoked the "simplicity" and "logic" of 
RNA regulation via base pairing, which only required 
an RNA size of 10-12 bases to “provide the necessary 
specificity ,?? foreshadowing the action of microRNAs 
and other small RNAs discovered at the turn of the next 
century (Chapter 12). 

Nevertheless, in the years following the /ac operon, 
the proposition of regulatory RNAs was disfavored, 
because the emerging models required that the repressor 
interact with small molecules (metabolic effectors), for 
which proteins with three-dimensional structures (such 
as allosteric enzymes, which alter their shape and activ- 
ity upon binding small molecules) seemed more suit- 
able, even though these views were “more doctrinal than 
empirical" and “the proteinaceous nature of the repressor 
was taken for granted”.12 


" The important concept of allostery (‘allosteric inhibition’) was 
advanced by Monod and Jacob in 1961 to describe binding of a 
ligand to one site in a protein causing a structural change that ham- 
pers the binding of a second ligand (a DNA sequence or an enzyme 
substrate) at another site.7? Monod and colleagues correctly pre- 
dicted that allostery might be a general form of cellular regulation, 
and that allosteric sites might be useful drug targets.?-?5 RNAs 
can also act as allosteric ‘riboswitches’ that respond to small mol- 
ecules and other cues, especially in bacteria, not discovered until 
2002 (Chapter 9). 


Halcyon Days 


This expectation was confirmed and its generality 
assumed when Walter Gilbert and Benno Muller-Hill 
found in 1966 that the /ac repressor is a protein and 
Mark Ptashne subsequently isolated the bacteriophage 
lambda repressor protein, both shown to specifically 
bind to regulatory (“operator”) DNA sequences upstream 
of the protein-coding genes.!2-13 Moreover, in 1965, 
Ellis Englesberg and collaborators had demonstrated the 
existence of protein ‘activators’ in the control of gene 
expression in bacteria, expanding the dominant 'repres- 
sor’ model (negative control),?! although, oddly, Monod 
was unconvinced.: Then, with the discovery of the ‘sigma 
factors' controlling RNA polymerase transcription initia- 
tion,!33134 the basic mechanism of regulating gene expres- 
sion in bacteria, and presumptively in higher organisms, 
seemed generally understood, despite RNA players 
emerging in the background (Chapter 9). 

These findings consolidated the conclusion that pro- 
teins comprise not only the ‘enzymes’ but also the ‘regu- 
lators’ of gene expression, and the suggestions by Jacob 
and Monod, Rich and others that some genes might spec- 
ify regulatory RNAs were relegated to history. 


PROTEIN STRUCTURE 


As aforementioned, an important development during 
this period was protein sequencing, first achieved in the 
late 1940s and early 1950s by Fred Sanger using partial 
acid hydrolysis to sequence insulin,'*? soon supplanted by 
cyclic cleavage of terminal amino acids called Edman 
degradation, after its developer, Pehr Edman in 1950, 
which was later automated. 


* During this period, Englesberg gave several seminars at the 
Pasteur Institute. As told by Jacob: "After each seminar, however, 
[Englesberg] received a severe lesson in regulatory genetics from 
Monod, who always insisted on a notion “that even a schoolboy can- 
not ignore: negative x negative equals positive! Englesberg said that 
*whenever I spoke with Jacob and Monod, they would say that they 
were 33.3% convinced, and then 50% convinced, about positive con- 
trol. When I gave a seminar at the Pasteur ... in 1972, they said “Well, 
we are 66.696 convinced"! 

The identification of the /ac repressor protein occurred in the same 
year as the first RNAs not associated with translation were detected 
in human cells,!% and just 1 year before the first abundant non-tRNA 
small RNA was discovered in E. coli. The latter was a 6S ubiquitous 
small (180—200nt) regulatory RNA,!%* whose function as repres- 
sor of sigma factor-dependent gene transcription and regulator of 
RNA polymerase promoter use was not determined until 30 years 
later." A second small (109nt) RNA found in E. coli in 1973138,139 
(a transcript named Spot 42, encoded by the spf gene) also had an 
unknown function until 2002: it is also a trans-acting antisense 
RNA, which represses the galactose operon (and indeed many other 
operons) at the post-transcriptional level by base pairing with the 
galK mRNA? (Chapter 9). 
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The determination of the amino acid sequence of pro- 
teins (later to be much more efficiently and accurately 
deduced from gene sequences) was an essential prerequi- 
site to the determination of their three-dimensional struc- 
tures using X-ray crystallography." It was developed and 
applied by John Kendrew and Max Perutz and colleagues 
in 1958 to hemoglobin and muscle myoglobin, which 
showed that these proteins folded into three-dimensional 
globular structures!%-1% (Figure 3.6). These studies also 
revealed the major structural elements of proteins, ini- 
tially encompassing a-helices, fi-sheets, turns and later 
transmembrane domains and enigmatic “intrinsically dis- 
ordered regions' (Chapter 16). 

The structural analysis of proteins was initially 
restricted by their ability to form crystals — creating these 
was and is an art in itself. Later, nuclear magnetic reso- 
nance imaging, first described by Isidor Rabi in 1938 and 
developed in 1946 by Felix Bloch and Edward Purcell, 
allowed the determination of relatively small proteins 
in solution by Kurt Wüthrich, Richard Ernst, Ad Bax, 
Marius Clore, Angela Gronenborn and Gerhard Wagner, 
among others, in the 1970s and 1980s,!%3154 and has con- 
tinued to be refined. More recently, the development 
of improved methods of cryo-electron microscopy by 
Jacques Dubochet, Joachim Frank, Richard Henderson? 
and others has allowed structural characterization of 
much larger proteins and protein complexes.!56-15? 

Atomic resolution of protein structure fueled enduring 
discovery, accelerated by high-throughput methods and 
many technical innovations that revealed the structure- 
function relationships, fine chemistry and dynamics in 
the enzymes, molecular machines and macromolecular 
components of cells. 


THE CENTRAL DOGMA 


Inthe late 1950s, before the identification of mRNA, Crick 
publicly articulated what he termed the “Central Dogma" 
of the directional flow of genetic information, 9.60 
reflecting earlier considerations by Boivin and Vendrely, 
Brachet, Watson" and Dounce. In this paradigm, proteins 
were the final destination of the information contained in 


" The entry of physicists into biology in the 1940s and 1950s revolu- 
tionized macromolecular analysis. There is a deeper history, with the 
development of the principles of X-ray diffraction by crystals and the 
mathematics involved to probe their atomic structure, dating back to 
the early 1900s, pioneered by Max von Laue and the father and son 
team of William Henry and William Lawrence Bragg.^9!? X-ray 
fiber diffraction data was, of course, also central to the elucidation of 
the double-helical structure of DNA. 

Y Watson sketched the Central Dogma in his lab notebook in 1952.161.162 
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FIGURE 3.6 The first three-dimensional model of myoglobin obtained by X-ray analysis. (Reproduced from Kendrew et al.!48 


with permission of Springer Nature.) 


DNA and conveyed by RNA, as once "information has 
passed into protein it cannot get out again". ^? 

In a subsequent formalization in 1970, Crick also 
included — by way of a dashed line — that RNA may be 
itself copied, along with a dashed line indicating that 
information could flow in reverse from RNA to DNA." 
but not from protein to RNA (Figure 3.7). These modi- 
fications were presumably made in the wake of the find- 
ing in the 1960s that RNA viruses replicate!'**1! and the 
discovery of virally encoded reverse transcriptase* inde- 
pendently, and with dogged determination in the face of 
skepticism, by David Baltimore and Howard Temin in 
1970.75-17 Reverse transcriptase was critical to enable 
the coming gene cloning revolution (Chapter 6) and the 


w Crick's original diagram of the flow of genetic information included 
a dotted line from RNA back to DNA.*° 

* RNA-dependent RNA polymerases are also found in eukaryotic 
cells."? There are a number of ‘DNA repair’ enzymes with reverse 
transcriptase activity in the brain, with obvious implications." 
Information also moves laterally from DNA via RNA to DNA by 
retrotransposition,"^ the outcomes of which dominate the genome 
and genetics of complex organisms (Chapters 4, 10 and 16). 


understanding of retroviral biology, the full implications 
of which have yet to be realized (Chapter 10). 

The Central Dogma has held true to this day (except 
for the speculative transfer of information directly 
from DNA to protein), but became widely interpreted, 
including by Watson,!% as ‘DNA makes RNA makes 
proteins', with its implicit assumption, not necessar- 
ily intended by Crick, that RNA functions only as an 
intermediate. 


IT'S ALL OVER NOW 


The two decades from 1953 to 1972 were exhilarating 
and the new crop of molecular biologists were rightly 
pleased with what had been achieved, but their self-sat- 
isfaction and hubris were palpable. The /ac operon and 
the Central Dogma consolidated the notion that (with 
exceptions like the few rRNA and tRNA types) genes are 
synonymous with proteins, and that all genetic informa- 
tion, including regulatory information, is transacted by 
proteins, not only in bacteria but also in developmentally 
complex plants and animals. 


Halcyon Days 
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A 
PROTEIN 


Fig. 3. A tentative classification for the present day. Solid arrows show 


general transfers; 


dotted arrows show special] transfers, Again, the 


absent arrows are the undetected transfers specified by the central 
dogma, 


FIGURE 3.7 Crick’s 1970 formulation of the “Central Dogma". (Reprinted by permission from Springer Nature.) 


Consequently, the hegemony of proteins as both struc- 
tural and regulatory molecules was established, prema- 
turely, within the first two decades of molecular biology, 
despite the odd molecular and genetic observations in 
plants and animals (Chapters 4 and 5) and a looming sur- 
prise that should have given pause for thought (Chapter 7), 
with more to come (Chapters 8-13). 

As Crick opined in 1958: "Biologists should not 
deceive themselves with the thought that some new class 
of biological molecules, of comparable importance to the 
proteins, remains to be discovered". ^? 

And Brenner in 1963: 


It is now widely realized that nearly all the 'clas- 
sical' problems of molecular biology have either 
been solved or will be solved in the next decade. ... 
Molecular biology succeeded in its analysis of 
genetic mechanisms partly because geneticists 
had generated the idea of one gene-one enzyme. 

. Molecular biology succeeded also because 
there were simple model systems such as phages 
which exhibited all the essential features of higher 
organisms so far as replication and expression of 
the genetic material were concerned ... It is prob- 
ably true to say that no major discovery compara- 
ble in importance to that of, say, messenger RNA, 
now lies ahead in this field. 


* Sydney Brenner, Excerpts from Letter to Max Perutz, June 1963; 
reproduced in Wood (1988).!78 

2 Sydney Brenner, excerpts from Proposal to the Medical Research 
Council, October, 1963; reproduced in Wood (1988).!78 


Gunther Stent proclaimed in his 1968 article entitled 
‘That Was the Molecular Biology That Was’: 


All hope that paradoxes would still turn up in the 
study of heredity had been abandoned long ago, 
and what remained now was the need to iron out 
the details ... [and that there remained] ... only 
one major frontier of biological enquiry for which 
reasonable molecular mechanisms cannot be 
envisaged: the higher nervous system.!” 


And Brenner added [Stent’s point is] 


that once we knew both the structure of DNA and 
that nucleotide sequences encoded amino acid 
sequences of proteins, and that once the princi- 
ple of gene regulation had been found by Jacob 
and Monod, there was nothing left to do. Thus 
embryology could be accounted for by simply 
turning on the right genes in the right place at the 
right time and that was the solution to the prob- 
lems of development. Not only did we not have 
to bother investigating the developmental biology 
of the millions of different species of animals and 
plants, but there would be no motivation for sci- 
entists to pursue those fields because the mystery 
had vanished.!8° 


The belief that genes are synonymous with proteins 
reflected the mechanical zeitgeist of the age. Bicycles 
and cars have parts, and so do organisms — proteins that 
carry oxygen (hemoglobin), form skin (keratins), signal 
energy levels (insulin) or control the activity of other 
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genes ('transcription factors"), etc. It was just assumed 
that these ‘conserved’ components, whose expression is 
regulated by trans-acting transcription factors acting on 
malleable adjacent promoter-operator sequences, were 
enough to explain all of biology. 

Little thought was given at the time to the enormous 
differences between bacteria and developmentally com- 
plex organisms. The ‘biochemical unity of life’ was just 
taken as given.'*! As Monod said in 1954, in a recapitula- 
tion of a 1926 assertion by the microbiologist Albert Jan 
Kluyver:** “Anything found to be true of E. coli must also 
be true of elephants."!*! 

That may be the case, but the logical trap was that the 
reciprocal might not be. No one knew. 
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4 Worlds Apart 


It was already evident that multicellular eukaryotes 
are orders of magnitude more complex than bacteria. 
Humans, for example, have -30-40 trillion cells!? that 
are precisely assembled during embryonic and post- 
natal development into a myriad of different and pre- 
cisely sculpted muscles, bone and other organs, and a 
brain with over 85 billion neurons and a trillion synaptic 
connections.?^? 

Eukaryotic cells are also generally much larger? and 
have more complex organization than bacterial cells. 
They also have larger genomes,” especially in multicel- 
lular organisms, split between linear chromosomes, with 
variable but usually substantial amounts of “repetitive” 
sequences (Chapters 5 and 10). 

Eukaryotic cells have a membrane-bound nucleus 
where the chromosomes are located, an important con- 
sequence of which is the separation of transcription 
from translation“ (Chapter 7). They also contain other 
internal membranous structures and membrane-bound 
“organelles”!! including caveolae (surface pits for endo- 
cytosis of external material);? endosomes (which traffic 
proteins, lipids and other components within the cell);!3 
peroxisomes (where specialized oxidative metabolism 
takes place);!* lysosomes (which degrade engulfed par- 
ticles and intracellular components);^-" the endoplasmic 
reticulum (ER, the ‘rough’ form of which is studded with 
ribosomes);*? the Golgi apparatus (a distribution net- 
work wherein proteins are imported from the ER, tagged 
with carbohydrates, sorted and packaged into endo- 
somal vesicles destined for lysosomes, the cell surface or 
export,???! named after their discoverer Camillo Golgi in 
18982); mitochondria (which generate energy by oxida- 
tion of carbohydrates and fatty acids);? and (in plants and 


a There are exceptions.? 

^ Bacterial genomes have a maximum size of around 10 Mb,’ are usu- 
ally circular and replicated bidirectionally from a single origin of 
replication, first shown by John Cairns in 1963.5 Bacteria can also 
contain additional circular DNAs called plasmids, a term introduced 
in 1952 by Joshua Lederberg to refer to "any extrachromosomal 
hereditary determinant"? Plasmids often carry antibiotic resistance 
genes or others that confer selective advantage and may replicate 
autonomously or become integrated into the chromosome. 
Transcription and translation are coupled processes in bacteria. 
Translational stalling can result in transcription termination, as in 
the trp operon, where it is used to attenuate the production of trypto- 
phan biosynthetic enzymes, shown by Charles Yanofsky in the late 
1970s.!° 
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algae) chloroplasts (photosynthetic energy capturing fac- 
tories that produce sugars, sometimes called ‘plastids’)** 
(Figure 4.1). These organelles display an intricate degree 
of interaction and coordination.?5?0 

Eukaryotic cells also have many non-membrane- 
bound compartments, such as the nucleolus (the site of 
ribosomal biogenesis within the nucleus) which was 
first observed in 1835.*! These compartments are phase- 
separated domains nucleated by RNAs and proteins 
containing intrinsically disordered regions (Chapter 16). 
Prokaryotic (bacterial and archaeal) cells have no inter- 
nal structures as obvious as those in eukaryotes, although 
there is spatial organization and compartmentaliza- 
tion.3234 Phase-separated domains occur in prokaryotes 
and may predate cellular life (Chapter 16). 

Recent advances in scanning electron microscopy 
and cryo-electron tomography have also enabled high- 
resolution imaging of cellular organelles and subcellular 
structures.%5-% 


THE ORIGIN OF CELLS 


The origin of life involved the evolution of macromol- 
ecules capable of transmitting information and catalyz- 
ing biosynthetic reactions, as well as their encapsulation 
and the harnessing of energy — the reversal of entropy 
to create ordered systems, as first argued by Schrédinger 
in 1944.94! There are two leading hypotheses about 
where and how this might have occurred: in deep ocean 
hydrothermal vents where proton gradients could form, 
proposed by Michael Russell, Nick Lane, Crispin Little 
and colleagues;?-^6 or in terrestrial hot springs where 
hydrothermal pools undergo wet-dry cycles (reprising 
Darwin’s “warm little ponds") that favor the synthesis 
of organic polymers, lipids, peptides and nucleic acids, 
put forward by David Deamer, Martin Van Kranendonk, 
Armen Mulkidjanian, Eugene Koonin, Steve Benner 
and others, ^55 with evidence favoring the latter.4/49.50.56 
There is also evidence that the ubiquitous presence of 
ATP is due, in part, to its ability to aid protein solubility.*” 

In 1977, Carl Woese?à and George Fox?? made the unex- 
pected discovery by ribosomal RNA gene sequencing 


7 Woese was also one of the originators of the RNA World Hypothesis 
in 1967 (Chapter 9), although not by that name.* The 3-domain the- 
ory took many years to be accepted, although the data was clear. 
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that there are not two but three distinctive domains of life 
on Earth:% the unicellular Bacteria and the superficially 
similar Archaea, collectively called Prokarya, and the 
unicellular and multicellular Eukarya — protozoa, algae, 
fungi, plants and animals.* 

The eukaryotes — the last common ancestor of which 
is estimated to have emerged around two billion years 
ago% — appear to have arisen from an archaeal progeni- 
tor that fused with a bacterium, as they possess many 
nuclear genes derived from each, but those involved in 
the core processes of DNA replication, transcription 
and translation, including the histones that are used to 
package eukaryotic chromatin (Chapter 14), are clearly 
archaeal in origin.*74 Whether there are two (original) 
or (now) three primary branches of life! is a moot and 
almost semantic point.’ 


© The terms ‘eukaryote’ and ‘prokaryote’ were first coined by Edouard 
Chatton in 1925, and the distinction formalized in 1962 by Roger 
Stanier and C. B. (Cornelius) van Niel, based on the presence or 
absence of a nucleus.°! 

f It has also been suggested that the giant viruses (Megavirales), dis- 
covered in amoebae in 2002, comprise a fourth super-kingdom of 
life. 


Cell wall 


Comparison of prokaryotic and eukaryotic cell structures, original illustration by Heidi Cartwright. 


Mitochondria and chloroplasts are also descended 
from bacterial ancestors? captured by endosymbiosis,2+% 
proposed controversially by Lynn Margulis (Sagan) 
in 19678! but later confirmed by Robert Schwartz and 
Margaret Dayhoff.? Mitochondria and chloroplasts con- 
tain remnant small circular genomes$083-86 and bacterial- 
like translation systems for key hydrophobic proteins that 
must be made in situ.5^5* 

A plausible theory advanced by Tom Cavalier-Smith is 
that eukaryotes initially made their living as cellular scav- 
engers and predators (think amoebae), which required the 
development of a flexible external membrane for phago- 
cytosis.5%% It also required internal membranes for the 
protection of the genome and compartmentalization of 
lysosomes and other organelles, which is consistent with 
the flexible membranes and microvesicles observed in the 
extant lineage of the proposed archaeal ancestor of the 
eukaryotes.” 


g Mitochondria are descended from a hydrogen-producing 
a-proteobacterium;”””8 chloroplasts from a cyanobacterium.7%80 


Worlds Apart 


GENETIC RECOMBINATION 


Whereas prokaryotes only have one genome copy (in 
addition to self-replicating extrachromosomal plasmids), 
eukaryotic cells are (usually) ‘diploid’, having obtained 
one nuclear genome copy from each parent! produced by 
the process of meiosis to form ‘haploid’ ‘gametes’ (sperm 
and ova). This and the elaboration of two sexes in eukary- 
otes may have arisen (although there are many possible 
explanations?^55) to allow recombinational exchange of 
larger genomes in complex cells and especially between 
multicellular organisms. 

In (most) eukaryotes, having two genome copies 
means having two copies of each of the genes therein, 
which are referred to as ‘alleles’ if they differ in any 
demonstrable way. These alleles may be dominant or 
recessive with respect to the other in their impact on phe- 
notype, as demonstrated by Mendel. This arrangement 
provides the advantage that defective alleles can be toler- 
ated, as having one functional version is usually enough 
to compensate, which enables flexibility. Regulatory 
variation may be co-dominant, which allows more com- 
plex dynamics in the evolution and expression of quanti- 
tative traits. 

DNA exchange occurs ad hoc in prokaryotes and at 
meiosis in eukaryotes, where it involves chromosomal 
pairing, formation of a 4-stranded cruciform structure 
between homologous DNA duplexes! (‘crossing-over’) 
and strand exchange.!% Homologous recombination was 
the basis of genetic mapping by Benzer, Morgan and in 
the first half of the 20th century (Chapter 2), and later 


^ Some are ‘polyploid’, such as wheat, which has six copies of each 
chromosome (hexaploid), and whose gametes have three. 
Mitochondrial genomes are maternally transmitted in ova, the 
source in part of the cytoplasmic inheritance of some characteristics 
first observed by the Boveris around 1900 (Chapter 2). Chloroplast 
genomes are usually also maternally inherited.?! 

Linear chromosomes may be required for meiotic segregation.” 
They are copied via multiple replication origins and an overlapping 
series of bidirectional replication bubbles, a highly complex process 
that is tightly controlled during differentiation and development” 
(Chapter 15). 

Some defective genes can be dominant for mechanistic reasons, 
such as those encoding proteins involved in multi-component com- 
plexes, because only one copy is expressed (as in parental imprint- 
ing; Chapter 5) or because they only occur on sex chromosomes, 
as exemplified by the higher frequency of color blindness in males 
(the genes are located on the X chromosome). Some ‘heterozygous’ 
combinations of functional and defective alleles produce intermedi- 
ate effects, called ‘haploinsufficiency’, which may be more common 
than appreciated and have benefits in some circumstances. 

Called a ‘Holliday junction’, after Robin Holliday who proposed it in 
1964,9899 
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used to track down the protein-coding genes that are 
damaged in disorders such as cystic fibrosis (Chapter 11). 

Recombinational exchange is an evolvability strategy 
that appears to have arisen at the dawn of life to enable 
genetic variations to be separated and discriminated by 
selection.?! Gene assortment and fault tolerance were 
major considerations in the development of evolutionary 
theory and mathematical models of population genetics. 

The evolution of the pathways and infrastructure for 
genetic recombination is an example of second-order 
Darwinian selection, where an accidental innovation has 
no particular or immediate phenotypic consequence but 
confers long-term advantages. There are almost certainly 
other evolutionary search optimization strategies that 
have not yet been recognized, because of the emphasis on 
phenotypic selection and the belief that mutation occurs 
randomly (Chapter 18). 

It is also worth noting the growing appreciation of 
transposons (Chapters 5 and 10) and viruses," the most 
abundant biological entities on Earth (which may have 
predated cellular life), as wider currencies for genetic 
exchange and dissemination, with central roles in the 
early evolution of cells, the ‘invention’ of DNA and DNA 
replication, the formation of the three domains of life and 
the diversification of multicellular organisms.105-110 


THE EMERGENCE OF COMPLEX ORGANISMS 


Multicellular plants appeared around 1-1.2 billion years 
ago, initially in the oceans, then in freshwater environ- 
ments.!!-!? They colonized the land around 850 million 
years (Myr) ago!? and diversified into more complex vas- 
cular forms with roots, leaves and seeds between 500 and 
200 Myr ago, when the angiosperms (flowering plants) 
emerged following an ancient genome duplication.!!4!!5 
The phylogenetic tree from algae to angiosperms is now 
being constructed from DNA and RNA sequence data.!!6 
Interestingly, improvements in the efficiency of CO, fixa- 
tion, known as the C4 pathway, appeared just 25-32 Myr 
ago, likely as an adaptation to declining CO, concentra- 
tions due to biological sequestration into calcium carbon- 
ate and carbonaceous deposits.!!” 

Sponges, the most primitive of the animal phyla, 
existed around 890 Myr. The first animals with 


m Norton Zinder and Joshua Lederberg showed in 1952 that bacte- 
rial viruses can integrate into the bacterial genome,'” providing 
an explanation for the lysogeny phenomenon earlier described by 
Eugene Wollman and André Lwoff, which became an important 
genetic tool for gene mapping.'? Animal retroviruses were described 
by multiple groups in the 1960s and 19705.19 
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complex body plans, the Ediacaran fauna," appear in the 
fossil record between 620 and 550 Myr ago in an evo- 
lutionary radiation called the Avalon explosion,!% with 
antecedents up to 800 Myr ago.!!! The Ediacaran fauna 
were soft-bodied organisms, ranging in size from 1cm 
to over 1m, likely making a living as scavengers — like 
fungi, with which animals share a common ancestor.'?!-?* 
Ediacarans had radial or bilateral symmetry and seg- 
mented tube-, quilt- or frond-like structures, some with 
similarities to modern worms and jellyfish,?5-? which 
may be their descendants.!30-133 

The Ediacarans were largely supplanted around 540— 
500 Myr ago by a second and more spectacular large-scale 
radiation. It is known as the Cambrian explosion, first 
revealed by fossils in the Burgess Shale of the Canadian 
Rocky Mountains, discovered by Richard McConnell in 
1886 and characterized in detail in the early 20th cen- 
tury by Charles Walcott and others since, notably Simon 
Conway Morris, in many Cambrian “Lagerstătten”. 
In one strata of rock, and an estimated time window of 
-10—20 Myr, recognizable ancestors of all extant meta- 
zoan phyla appear, including arthropods and chordates, 
with hard skeletons, advanced predation and locomotive 
capacity, along with other bizarre forms!?-? (Figure 
4.2), likely driven by the evolution of macrophagy.'^? 

Soon after, fish were swimming in the oceans." 
Similar rapid phenotypic diversifications also occurred 
after later mass extinction events," including that follow- 
ing the meteorite strike in the Gulf of Mexico 66 Myr ago, 
which wiped out the non-avian dinosaurs and allowed the 
rise of mammals into vacated ecological niches.!43144 

The initial appearance and rapid evolution of animals 
has commonly been thought to have been potentiated by 
the increase in atmospheric oxygen from photosynthe- 
sis and the advantages of aerobic energy generation by 
mitochondrial electron transport? in eukaryotes.??.146-14$ 
However, there was substantial atmospheric oxygen long 
before the evolution of animals,!%15 and it seems that 
sufficient oxygen may have been an enabler but not a 
direct cause of their emergence.!*1152 Rather, the transi- 
tion from unicellular to developmentally complex organ- 
isms with highly organized assemblages of specialized 
cell types was likely achieved by advances in genome 
organization and regulatory systems (Chapters 14—16). 
Later transitions, undoubtedly also requiring genetic 
innovations, occurred in the colonization of the land, 


" Named after the Australian site where they were found in abundance 
by Reginald Sprigg in 1947.!? 
? There are conflicting views.!^ 
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to enable physiological adaptability to a more variable 
environment and more complex structures for terrestrial 
mobility. 


CHROMATIN 


The most obvious molecular genetic difference between 
prokaryotes and eukaryotes is that the much larger 
genomes of the latter are not only sequestered in a nucleus 
but also segmented into chromosomes and packaged into 
complex chromatin structures. 

Eukaryotic chromatin is not homogeneous, and con- 
tains regions with different properties, broadly divided into 
open 'euchromatin' (gene rich, transcriptionally active, 
lightly stained) and compacted ‘heterochromatin’ (tran- 
scriptionally quieter, densely stained), first documented in 
the late 1920s by Emil Heitz,'% a pioneer of cytogenet- 
ics, ^^ notably in the giant ‘polyploid’ or ‘polytene’ chro- 
mosomes? in the salivary glands of insects.!57-159 

Dynamic changes in chromatin were observed in the 
appearance and disappearance at different developmen- 
tal stages of “facultative” heterochromatin by Heitz and 
others,!5 and ‘puffs’ in polytene chromosomes formed 
by localized decondensation of small chromosomal 
segments, described by Donald Poulson and Charles 
Metz in 1938 (Figure 4.3), and by others in the 1950s 
and 1960s.!%-162 Puffs exhibit developmental stage- and 
tissue-specific patterns, 6?/9^ can be induced by heat 
shock! and hormones such as ecdysone,!% and are sites 
of RNA synthesis.!%-172 In 1973, it was shown by Adolf 
and Monika Graessman that puff induction involves 
RNA,” and later by Subhash Lakhotia and colleagues 
that the product of a puff induced by heat shock is an 
RNA that does not encode a protein"^!? (Chapter 9). 

Specific heterochromatic regions of eukaryotic chro- 
mosomes form centromeres,"9!7 which act as organiz- 
ing centers of cell division, described first by Edouard 
van Beneden and then by Boveri (who coined the name) 
in the 1870s and 1880s.7*.7? Centromeres contain inter- 
nal granules (called ‘centrioles’ by Boveri) and attach to 
kinetochores for spindle formation and chromatid pair- 
ing and separation to daughter cells during mitosis and 
meiosis'®° (see Chapter 15). 


? Polythene chromosomes were first observed by Édouard-Gérard 
Balbiani in 1881, who is remembered in the term “Balbiani ring’ 
which refers to the large chromosomal puffs where transcription 
occurs. They contain multiple DNA molecules in parallel, gener- 
ated by multiple rounds of DNA replication without an intervening 
cell division. Their banded structure corresponds to topologically 
associated domains (Chapter 14), which are preserved between poly- 
tene and diploid cells.!* 
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FIGURE 4.2 Cambrian fossils from the Burgess Shales, including ancestral arthropods (a-d, h-j, l-s), primitive chordates (e, f), 
annelid (g) and mollusk (k). (Reprinted from Caron et al.!* by permission of Springer Nature.) 
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FIGURE 4.3 Micrographs of banded polytene chromosomes, arrows indicate ‘puffs’. (Reprinted from Poulson and Metz !9? by 


permission of John Wiley and Sons.) 


In 1959, Susumu Ohno showed that one of the two 
X-chromosomes in female mammals is heterochromatic!*! 
(called ‘nucleolar satellite” and later the “Barr body’ after 
its discoverer, Murray Barr!??)). In 1961, Mary Lyon dem- 
onstrated that X-chromosome inactivation occurs ran- 
domly in early embryogenesis:!5*/*^ females are mosaics 
of active X-chromosomes inherited from either parent,‘ 
a “dosage compensation mechanism to equalize with 
males, who only have one X-chromosome - a traditional 
system in genetic and cytological studies" (Chapter 2) 
later shown to be controlled by RNA: (Chapter 9). It was 
also known that in some insects one entire set of chro- 
mosomes becomes heterochromatic during male early 
embryonic development,5? and that chromosomes in 
embryonic cells often have a different morphology than 
those in adult cells.!?? 

The existence of “facultative” heterochromatin, position 
effect variegation, chromosomal puffs and ‘lampbrush’ 


4 The mosaic pattern of X-chromosome inactivation can be observed 
in variegated coat colors, such as in ‘tortoise shell’ cats, which are 
almost invariably female (except when chromosomal aberrations 
such as XX Y occur), and in the mosaic pattern of sweat glands in 
women. 

X-linked (sex-linked) traits have also been of great value to 
human genetics (Chapter 11), given that recessive mutations are 
exposed in males, classic examples being red-green color-blind- 
ness and Duchene muscular dystrophy,!% with variable intermedi- 
ate phenotypes (severity of effect) in females because of mosaic 
expression.!*7 

In Drosophila, dosage compensation is achieved not by inactiva- 
tion of one of the X-chromosomes in females, but by global upreg- 
ulation of the activity of the single X-chromosome in males,'** 
which also centrally involves non-coding RNAs (Chapter 9). 


chromosomes' (described by Alexander Flemming in 
188212), which occur in the oocytes of all animals except, 
curiously, mammals,?* suggested that there are higher- 
order genomic arrangements and additional modes of gene 
regulation during plant and animal development.'*? 

It was also found that eukaryotic DNA is wrapped 
around proteins called histones," like cotton around a 
spool in a repeating structure, called 'nucleosomes'.?* 
Histones were identified by Kossel in 1894,% but it was 
another 80 years before nucleosomes were visualized in 
the electron microscope by Ada and Donald Olins200-202 
(Figure 4.4) and their octameric histone complement 
defined by Roger Kornberg and colleagues.20320% It was 
even longer before it became evident that histones are the 
major repositories of epigenetic information (Chapter 14), 
although hints of a role in gene regulation were emerging. 

In 1950, Ellen and Edgar Stedman" proposed that 
histones are repressors that could inactivate genes in a 
tissue-specific manner, based on the quantities of his- 
tones in growing and non-growing tissues.2052% Their 
proposal was supported by work of Ru-chih Huang and 


t In the late 1950s, electron microscopic visualization of elongating tran- 
scripts on lampbrush chromosomes first suggested rapid packaging of 
nascent RNAs with proteins.!”! In the 1970s, several laboratories began 
focusing on biochemical purification and compositional/structural anal- 
ysis of non-ribosomal ribonucleoproteins, which led to identification of 
mRNA cap-binding proteins, polyA-binding protein, pre-mRNA splic- 
ing proteins (Chapter 8) and mRNA transport proteins, among others.'?? 
Eukaryotic histones have ancestral orthologs in archaea, 7. where they 
regulate gene activity in response to environmental circumstances.!% 
Histones have been shown to have copper reductase activity,'? suggest- 
ing their original role was to protect against oxygen toxicity.'?? 

The Stedmans had earlier suggested that non-histone chromosomal 
proteins, which they called “chromosomins”, represent the “basis 
of inheritance" and are also involved in gene regulation, predicting 
that the physical association of chromosomins and nucleic acids was 
required for synthesis of specific proteins.2% 
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FIGURE 4.4 Electron micrographs taken by Donald and Ada Olins of (a) and (b, size marker 30nm) low ionic strength chro- 
matin spreads showing the ‘beads on a string’; (c) nucleosomes derived from nuclease-digested chromatin (size marker 10nm); 
(d) chromatin spread at a moderate ionic strength showing a 30nm higher-order structure (size marker 50nm). (Reproduced from 
Olins and Olins??!29 with permission of American Scientist (a) and Springer Nature (b-d).) 


James Bonner, and Vincent Allfrey, Alfred Mirsky and 
colleagues, who found that histones inhibit the transcrip- 
tion of DNA in vitro.?97208 

However, histones, superficially at least, displayed uni- 
formity between tissues and species, as well as between 
‘repressed’ and active chromatin (see below), which led 


John Frenster, Allfrey and Mirsky to conclude in 1963 
that they were unlikely to be gene-specific regulators??? 
For decades thereafter, nucleosomes were considered pri- 
marily a mechanism for compacting large genomes,?!0-212 
given the widespread conviction that transcription factors 
are the primary means of gene regulation.?? 
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In the mid-1960s, Allfrey and Mirsky proposed that 
post-translational modifications of histones (acetylation 
and methylation) have regulatory functions.21+215 They 
showed that lymphocyte activation triggers massive acet- 
ylation of chromatin?! and that histone acetylation also 
occurs in insects?" A decade on DNA methylation was 
also suggested as a mechanism to regulate gene activ- 
ity,218219 although these ideas would only be tested and 
confirmed much later?20-22 (Chapter 14). 

Beyond this, how the structure of chromatin was orga- 
nized and how it affected gene expression in eukaryotes 
was unknown; progress was slow because of the sheer 
size and complexity of the genomes and chromosomes, 
and the difficulties of working with all but unicellular 
models such as yeast. 


CHROMATIN-ASSOCIATED RNAs 


By this time, it was also clear that RNA is the third 
component of chromatin. In his early histological exper- 
iments on the distribution of DNA and RNA, Brachet 
showed that RNA is present not only in the cyto- 
plasm but also in chromatin,?*?5 subsequently found 
by Mirsky and Hans Ris to reside in a NaCl insoluble 
fraction, which comprised only ~10% of the total but 
retained all the usual features of chromatin.??° RNA was 
also reported to remain attached to chromosomes dur- 
ing cell division.??? 

Indeed, while largely overlooked during the heady 
days of the genetic code, a number of publications in the 
1960s and early 1970s reported the presence of RNA in 
chromatin fractions,?* some of which were proposed 
to be structural and regulatory agents. These included 
Frenster’s 1965 model of “De-repressor RNAs’, based on 
the observation that RNA added to heterochromatin frac- 
tions increased the level of transcription, which was most 
pronounced when nuclear RNAs were added (compared 
to cytoplasmic and non-specific RNAs such as rRNA 
or yeast RNA), posited to involve RNA hybridization to 
complementary sequences in the repressed DNA ????30 

In 1965 also, Huang and Bonner reported the pres- 
ence of low molecular weight RNAs in chromatin. These 
“chromosomal RNAs” (cRNAs) were protected from 
RNase degradation and corresponded to ~8% of the total 
nucleic acid mass present in nucleohistones.?! This was 
the first in a series of reports of the existence of tissue- 
specific short RNAs in chromatin in plants and animals 
that associate with non-histone chromatin proteins and 
can hybridize to homologous DNA,???*6 leading to the 
hypothesis that cRNAs had a role in regulating gene 
expression.235237-239 
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These short RNAs are not precursors for any cyto- 
plasmic product? and some were distinguished by a 
high content of methylated or dihydropyrimidine nucleo- 
tides,73%240241 a signature of small nucleolar and small 
spliceosomal RNAs (Chapter 8). 

Interestingly, cRNAs were found to hybridize exten- 
sively to “middle repetitive” DNA sequences, which 
Bonner proffered as evidence that repetitive sequences 
may be regulatory elements.237242243 These observations 
would have impact on models of genome regulation in 
the higher organisms (Chapter 5), but were sidelined later 
by the widespread assumption that much of the genomes 
of higher organisms is junk, partly and ironically 
because they contained so many “repetitive” sequences 
(Chapters 7 and 10). 

Soon after the Frenster and Bonner publications, 
William Benjamin and colleagues reported that RNA 
isolated from a rat liver nucleoprotein fraction co-sedi- 
mented with histones.?^^ This RNA had high adenine and 
uridine content and had heterogeneous sizes by sucrose 
gradient analysis, adding to the complexity of the types 
of RNAs of unknown functions found in the eukaryotic 
nucleus. Although recognizing that these RNAs might 
represent an intermediate in the synthesis of mRNA, 
Benjamin et al. also speculated that these RNAs could 
play a role in the control of gene expression, invoking 
Paul Sypherd and Norman Strauss’ 1963 suggestion that 
regulatory systems involve both protein and RNA,** 
such that associated RNAs might confer specificity to 
repressive histones by base-pairing with the target gene?** 
(Chapter 16). 

During the 1970s, nuclear RNAs began to be better 
characterized. Some were visualized with chromatin at 
specific stages of the cell cycle and proposed to act as 
“programmers” of “chromosomal information” and gene 
regulation.2% In vitro experiments by Takeharu Kanehisa 
and colleagues with purified chromatin indicated that 
specific short chromatin-associated RNAs could *mod- 
ify" chromatin structure and stimulate RNA synthesis, 
particularly in chromatin isolated from the same tissue, 
suggesting a tissue-specific effect,247-24 

In 1973, Isaac Bekhor showed that “chromosomal 
RNA-protein complexes" can interact with DNA in vitro 
and increase its melting temperature, indicating a stabi- 
lizing effect, leading him to postulate that CRNAs pres- 
ent in chromosomal RNA-protein complexes constituted 
a structural component of chromatin, rather than regula- 
tory molecules,?*? an idea favored by other studies report- 
ing RNAs associated with heterochromatin.?! In 1978 
Sheldon Penman's group showed that stable species of 
high molecular weight RNAs also associate with nuclear 
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complexes, from which it was again hypothesized that 
“RNA networks” had structural roles in the nucleus.?? 

Thoru Pederson and Jaswant Bhorjee reported in 1979 
that three short RNAs are associated with chromatin and 
favored the hypothesis that these "DNA-linked RNAs" 
are involved in the control of the tertiary structure of 
chromatin.2% These RNAs were -130-200-nt long, 
highly abundant, relatively stable, and were designated 
small nuclear RNAs D, C and G?’ (later small nuclear 
RNAs UI, U2 and US, respectively, Chapter 8). They 
showed that only a fraction (<10%) of these RNAs is 
associated with chromatin, supporting the earlier results 
of Mirsky and Ris, while the remainder was nucleoplas- 
mic.?? Others reported that separated fractions contained 
between 6 and 11 size classes of small RNAs, most of 
which seemed to be reversibly bound to chromatin pro- 
teins,233254255 but some of which did not dissociate in 
high salt concentrations, possibly reflecting RNA-DNA 
hybrids.2% 

Later studies using hybridization and cytogenetic 
techniques showed that cRNAs from human placenta 
displayed a widespread pattern of hybridization to meta- 
phase chromosomes, preferentially in telomeric regions 
and heterochromatic short arms of acrocentric chromo- 
somes, as well as regions with a high content of repetitive 
DNA 2?! Chromatin fractionation studies using Frenster's 
method had also indicated that heterochromatin contains 
a more stable fraction of chromatin-associated RNAs 
compared to euchromatin.??! 

Other studies focused on the effects of high molecular 
weight RNAs and nascent transcripts in chromatin and in 
other nuclear structures. Jean-Pierre Bachellerie and col- 
leagues used subnuclear fractionations, autoradiographic 
and ultrastructural techniques to show that “perichro- 
matin fibrils represent the morphological state of newly 
formed heterogeneous nuclear RNA (hnRNA)”.758 

Much of this would make sense later (Chapters 7, 8 
and 16), but at the time there was a fierce debate over the 
existence, biological relevance and specificity of chro- 
matin-associated RNAs. The concerns ranged from the 
reproducibility of the findings, the questionable purity of 
cellular and chromatin fractions (complicated by a pleth- 
ora of fractionation procedures), the presence of nucle- 
ases that could lead to contamination with degradation 
products of tRNAs, rRNAs and heterogeneous nuclear 
RNAs (see below), and the feeling that the models pro- 
posed were too speculative.25%-264 

On the other hand, a number of lines of evidence were 
proffered against the criticisms, including that cRNAs 
have characteristic elution properties, their complexity 
and hybridization kinetics differed from common RNAs 
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such as tRNA and rRNA, they had a different stability, 
and chromatin preparations using stringent methods con- 
tained a reproducible RNA fraction.235:236.239243256,265,266 

By the end of the 1970s, different groups that had ana- 
lyzed the composition of total chromatin estimated that 
the RNA/DNA ratio of the chromatin was approximately 
—0.05-0.2 in different eukaryotes, and that distinct 
classes of RNAs with specific properties were chroma- 
tin-associated, including species of varying stability and 
molecular weight,255256267269 which would include new 
classes of infrastructural RNAs involved in rRNA bio- 
genesis and splicing (Chapter 8). 

Nevertheless, uncertainty about chromatin-associated 
RNA remained because the characteristics of the 
reported RNA profiles varied by tissue and the method of 
chromatin isolation. The exact nature of these RNAs was 
unknown, given that the methods of identification were 
rudimentary at that time. It was a complicated muddle. 


EARLY MODELS OF RNAs IN 
NUCLEAR ARCHITECTURE 


There was also emerging evidence that RNA is involved 
in the organization of the nuclear ‘matrix’, a fibrous struc- 
ture first reported in 1948,??? although its composition, 
stability and function has been the subject of ongoing 
conjecture.?”! 

Penman recognized that chromosomes are not ran- 
domly distributed in the nucleus, much later described 
in detail (Chapter 14), and that the nuclear matrix played 
a central role in its three-dimensional architecture. His 
group used a chromatin-depletion strategy to show that 
ribonucleoprotein networks extend throughout a nuclear 
structural lattice and that the integrity of nuclear and 
chromatin architecture is dependent on RNA, as indi- 
cated by the treatment of cells with transcription inhibi- 
tors or extracted nuclei with RNase." Penman concluded 
that RNA is not only a structural component of the 
nuclear matrix (an *RNA-dependent nuclear matrix") but 
also organizer of higher-order structures of chromatin 
(“architectural RNA”),27327 an idea that would reemerge 
powerfully much later when it was discovered that RNA 
and transcription modulate chromatin territories and 
nucleate subcellular domains (Chapter 16). 

Similar ideas were elaborated by others, including the 
‘Unified Matrix Hypothesis’ postulated in 1989 by Klaus 
Scherrer, in which the transcribed and non-transcribed 


* RNase treatment had been used previously (in 1974) to show that 
RNA is involved in the maintenance of condensation of dinoflagel- 
late chromosomes.?? 
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part of non-coding DNA would have a direct morpho- 
genic function.* According to this hypothesis, these 
regions have an intrinsic role in “the tridimensional 
network of chromatin and nuclear topological organi- 
zation".?^ This was posited to explain phenomena such 
as the “Chromosome Fields’ with co-localization of 
linked genetic loci within chromosome regions and the 
specificity of sites of chromosome recombination in 
cancer.” Scherrer also suggested that RNA processing 
may play a role in nuclear architecture by organizing the 
selective transport and control of individual transcripts, 
which would then act as signals for specific proteins and 
in combination define the nuclear matrix.” 


HETEROGENEOUS NUCLEAR RNA 


Radioactive labeling kinetics, sucrose gradient sedimen- 
tation and hybridization studies in the early 1960s by 
Scherrer, James Darnell, Georgii Georgiev and colleagues 
indicated that rapidly labeled transcripts were formed in 
the nucleus of mammalian cells, which exhibited high 
molecular weight and heterogeneous sedimentation 
profiles (Figure 4.5), AU-rich composition and unstable 
character.?772% These unexpected ‘giant’ nuclear tran- 
scripts were only broadly defined and described variously 
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as DNA-like RNA (‘ARNA’), nascent messenger-like 
RNA (nascent ‘mIRNAs’)*! and heterogeneous (or het- 
erodisperse) nuclear RNA (‘hnRNA’).?°? 

In 1963, Scherrer and Darnell showed that ribosomal 
RNAs in human cells are initially produced as large pre- 
cursor molecules and subsequently processed into mature 
rRNAs,22283 confirmed by Penman and Guiseppe and 
Barbara Attardi,?84285 and shown to take place in the 
nucleolus.256257 This may have been an idiosyncratic fea- 
ture of the ribosomal operons, but they also reported that 
other “giant” RNAs with “messenger RNA properties" 
also exist in the nucleus.?83,284,288 

The biological significance of the large heterogeneous 
transcripts was puzzling. The difference in base com- 
position between hnRNA and cytoplasmic mRNA, as 
well as differences in their half-lives, did not suggest a 
simple relationship. Henry Harris noted in 1965 that 
"only a small proportion of the RNA made in the nucleus 
of animal and higher plant cells serves as a template 
for the synthesis of protein", and considered that “most 
of the nuclear RNA, however, is made on parts of the 
DNA which do not contain information for the synthe- 
sis of specific proteins. This RNA does not assume the 
configuration necessary for protection from degrada- 
tion and is eliminated" (quoted in 2%), Harris later noted 
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FIGURE 4.5  Sucrose gradient profiles of pulse-labeled DNase-treated nuclear and cytoplasmic preparations before and fol- 
lowing actinomycin D treatment (which blocks RNA synthesis) showing the presence of high molecular weight RNAs in the 
nucleus but not the cytoplasm. Open circles, radioactivity; continuous line optical density at 260 nm, which detects nucleic acids. 


(Reproduced from Penman? with permission of Elsevier.) 
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that “pulse-labeled RNA was almost universally misdi- 
agnosed as messenger RNA" and that other suggestions 
were considered profoundly heretical at the time.??! 

In 1966, Scherrer was the first to propose that 
hnRNAs are precursors of mRNAs??? but, despite his 
intense efforts, his proposal was not widely enter- 
tained.*”° In 1968, Penman showed that “heterogeneous 
nucleoplasmic RNA” is produced in the absence of ribo- 
somal RNA synthesis (blocked by specific concentra- 
tions of actinomycin D) and is turned over rapidly, with 
a mean life of approximately 1 hour, although he did 
not think that this might be a precursor to mRNA 292293 
Penman also showed that the size of hnRNA increased 
with increased genome size,??^ all of which is consistent 
with the later discovery of introns and large pre-mRNA 
primary transcripts that are spliced to produce mRNAs 
(Chapter 7). 

Hybridization studies by Allfrey and Mirsky indi- 
cated that while ~80% of the DNA was not accessible 
for transcription, transcription products covered up to 
20% of the DNA in a given mammalian cell.*% This 
and similar observations were potentially relevant for 
gene regulation because, given that the majority of the 
genome DNA was inactive in a particular cell type, it 
implied a mechanism of genome repression and allowed 
the suggestion of specialization of chromosomal regions 
for the control of growth and development of differenti- 
ated tissues.205.295 

In 1967, Ruth Shearer and Brian McCarthy showed 
that while all the sequences of cytoplasmic RNA were 
present in the nucleus, the latter contains a much greater 
fraction of sequences that are not exported to the cyto- 
plasm, but rather are rapidly turned over,” confirmed by 
others.??7295 They went on to say: “The existence of RNA 
molecules specific to the nucleus suggests a role as medi- 
ators of the regulation of gene transcription ... although 
these functions are entirely speculative, the finding that 
the majority of the active genome codes for short-lived 
RNA molecules which are restricted to the nucleus opens 
up exciting possibilities for the study of the regulation of 
gene action in mammals."?96 


HEROES OR FOOLS? 


Some working in animal genetics at the time, such as 
Ed Lewis (Chapter 5), drew the obvious conclusion that 
that “the ‘genome’ of phage and bacteria may be struc- 
turally organized in manner different from the chromo- 
somes of higher forms”.? However, not much notice 
was taken. The /ac operon dominated the models of 
gene organization and the regulation of gene expression 
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in eukaryotes; it was the basis of the ruminations by 
Jacob and Monod,*% and others such as Ernst Mayr?! 
on “genetic programmes" and gene networks (or “nets of 
interacting genes"), in what developed into the belief that 
the combinatorial action of “transcription factors” is suf- 
ficient to execute complex developmental programs*02-304 
(Chapter 15). 

In a 1969 paper entitled “On the Structural 
Organization of Operon and the Regulation of RNA 
Synthesis in Animal Cells, Georgiev assumed that 
the principles of regulation of transcription defined 
in bacteria are retained in multicellular organisms.??* 
He defined an operon as an “elementary unit of tran- 
scription" and proposed that operons in higher organ- 
isms consisted of a promoter-proximal regulatory 
“zone” and a structural zone that contained the coding 
sequences of several mRNAs with related functions, as 
in the /ac operon. In his view, the entire operon would 
be transcribed as a "giant D-RNA”, with regulatory 
sequences at the 5' end degraded in the nucleus and the 
mRNA transferred into the cytoplasm. Being aware 
of Roy Britten's findings regarding the abundance of 
repeat regions in the genome,*% he also posited that 
repetitive sequences could be present in many oper- 
ons and be targets for regulatory proteins, thereby 
assigning a functional role for the repetitive sequences 
scattered throughout the genome, similar to that pro- 
posed in the same year by Britten and Eric Davidson 
(Chapter 5). 

Scherrer also proposed (in 1968) that hnRNAs are 
polycistronic precursors of mRNAs and be subject to 
both transcriptional and post-transcriptional regula- 
tion, the “cascade regulation" model.?8! Accordingly, 
a polycistronic hnRNA may be cleaved sequentially 
in the nucleus or in the cytoplasm to generate multi- 
ple mRNAs in a regulated fashion. Indeed, this logic 
assumed that “as a consequence of the central dogma, 
postulating that gene activity leads to phenotypic 
expression of genetic information through the media- 
tion of mRNA, localized genes related to particular 
phenotypic characteristics should produce mRNA"! 

In the following years many observations pointed 
to a relationship between hnRNA and mRNA. In 
1971, Darnell, George Brawerman, Mary Edmonds 
and colleagues found that eukaryotic mRNAs* and 
hnRNAs both contain an extended sequence of adenines 


* Prokaryotic mRNAs can also contain a shorter polyA sequence at 
their 3' end??? 
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(‘polyA tails") at their 3' end,?4^?!6 confirmed by oth- 
ers, 16318 and that eukaryotic mRNAs are derived from 
longer precursors?! In 1974, Kin-Ichiro Miura, 
Yasuhiro Furuichi, Aaron Shatkin, Fritz Rottman and 
colleagues showed that eukaryotic mRNAs and pre- 
mRNAs both also contain an inverted modified (meth- 
ylated) nucleotide (‘cap’) structure? at their 5' end32432 
(m7G, later shown to play a role in RNA splicing and 
translational control?953), also confirmed,??— along 
with a contemporary report of widespread methylation of 
mRNA? (Chapter 17). 

These findings supported the conclusion that hnRNAs 
are, in fact, precursors to mRNAs,*1% with a suggestion 
that mRNA might be comprised of fragments from each 
end of the hnRNA,?99?!8 but the idea that hnRNAs are 
mRNA precursors was only accepted and understood 
after the discovery of introns and splicing in 1977776 
(Chapter 7). 

On the other hand, it was gradually recognized by the 
late 1960s that the multigenic operon was an unlikely 
mode of eukaryotic genome organization and regulation 
since, for example, related genes such as those encoding 
alpha and beta hemoglobins or the enzymes for galactose 
metabolism were not co-localized in the genome, and 
the length of hnRNAs was much larger than that which 
would be expected for reasonably sized polycistronic 
mRNAs/5! 

Because the relation. of hnRNAs and mRNA 
was not yet established, Scherrer and Lise Marcaud 


y The existence of 3' polyA tails in mRNAs was exploited and remains 
widely used in cloning and sequencing protocols (Chapter 6), to sep- 
arate mRNA from the large amounts of ribosomal, transfer and other 
RNAs in cells by its hybridization to oligomers of U or T (‘oligo dT") 
affixed to solid or colloidal surfaces.3% 

Later the 5' cap would also be exploited for mRNA purifica- 
tion,*” although it turned out that many other transcripts that do not 
encode proteins are also capped and polyadenylated (Chapter 13). 
Moreover, while widely overlooked, a large fraction of the RNAs 
detected in human cells are not polyadenylated, although they are 
often capped.?!0315 
In fact, there are at least 25 different types of 5' caps in eukaryotic 
cells, at least some of which are cell- and tissue-specific, with roles in 
the initiation of protein synthesis, protection from exonuclease cleav- 
age and as identifiers for recruiting protein factors for pre-mRNA 
splicing, polyadenylation and nuclear export.???525 
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contemplated that hnRNAs might contain sequences 
"other than those carrying structural cistrons", such as 
regions for interaction with proteins through secondary 
and tertiary structures, thus conferring specificity to 
the “functional RNA”.”*! Alternatively, it was suggested 
that these RNAs might have independent regulatory 
roles, such as interacting with other regulatory mole- 
cules (inducers and repressors) or regulating allosteric 
proteins .??! 

In one way or another, many of these speculations 
would prove true, but the technology of the time was still 
too limited to test them, and animal and plant systems 
were much too complicated. At that time only a hand- 
ful of laboratories tried, valiantly, to understand genetic 
information and gene expression in eukaryotes. As 
recalled by Scherrer, 


only a few investigators were interested in the 
molecular biology of animal cells; the “serious” 
research was with E. coli and bacteriophages. 
James Watson visited MIT frequently, and would 
discuss our strange results. One day, he told me 
“To work with animal cells, you've got to be a 
hero or a fool!”276 
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The genomes of “higher organisms' are thousands of 
times larger than that of E. coli.! Viral genomes range 
from ~1 kb to 2.5 Mb.?? Bacterial and archaeal genomes 
range from -300kb to 11 Mb (the upper limit, apparently, 
see Chapter 15), protozoan and fungal genomes from ~8 
to 100 Mb (most less than 50 Mb),* animal genomes from 
-100 Mb to 5 Gb (average 1.3 Gb) and plant genomes? 
from ~100 Mb to 20 Gb (average 1.5 Gb). By compari- 
son, the combined length of the two largest human genes, 
dystrophin (2.4 Mb; required for muscle integrity)?!? and 
the neurexin CNTNAP2 (2.3 Mb; which functions in the 
nervous system)!! is approximately the same as the entire 
E. coli genome. 

In 1964, Friedrich Vogel estimated that the human 
genome (-3.3 Gb) encodes around 7 million genes, based 
on the size of hemoglobins and the assumption that “most 
of the DNA works as genetic material"? Similar num- 
bers were obtained using similar logic by others — in 
1969 the theoretical biologist Stuart Kauffman extrapo- 
lated James Watson's estimate of 2,000 genes in E. coli 
in the first (1965) edition of his book *Molecular Biology 
of the Gene”! to predict that there are 2 million genes in 
humans." 

Vogel recognized that his estimate was disturbingly 
high and speculated that "the systems of higher order 
which are connected with structural genes in operons and 
regulate their activity might occupy a much larger part of 
the genetic material than the structural genes”.!? 

On the other hand, the existence of upward excep- 
tions, notably in some amoebae, plants and arthropods, 
to an otherwise consistent increase in the amount of DNA 
in cells of eukaryotes of differing developmental com- 
plexity (Figure 5.1) led to the so-called C-value enigma,’ 
which was invoked to support the growing idea that the 
genomes of higher organisms carry a lot of superfluous 
DNA (Chapter 7). 


REPETITIVE DNA 


In the late 1960s, Roy Britten and colleagues analyzed 
the patterns of DNA renaturation'^" and concluded 


The smallest genome known is that of the parasitic microsporidian 
Encephalitozoon intestinalis (haploid 2.3 Mb)? the largest known 
are the marbled lungfish Protopterus aethiopicus (ca. 130 Gb) and 
the monocot plant Paris japonica (ca. 150 Gb). 
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that, unlike bacteria, “in general, more than one-third of 
the DNA of higher organisms is made up of sequences 
which recur any where from a thousand to a million times 
per cell” likely to have “arisen from large-scale precise 
duplication of selected sequences, with subsequent diver- 
gence caused by mutation and the translocation of seg- 
ments of certain member sequences" (Figure 5.2). 

They suggested that these sequences and events have 
important evolutionary implications: 


The range of frequency of repetition is very wide, 
and there are many degrees of precision of repeti- 
tion in the DNA of individual organisms. During 
evolution the repeated DNA sequences apparently 
change slowly and thus diverge from each other. 
There appears to be some mechanism which, from 
time to time, extensively reduplicates certain seg- 
ments of DNA, replenishing the redundancy." 


Britten and colleagues also cited many studies that 
observed changes in the pattern of types of hybridizable 
RNA synthesized during embryonic development and 
liver regeneration (noting that the hybridization condi- 
tions used selectively favored repetitive sequences), 
which showed "direct evidence for the genetic func- 
tion of at least some of the repeated DNA sequences" 
and "that during the course of differentiation different 
families of repeated sequences are expressed at differ- 
ent stages”.!”18 Many other studies have since confirmed 
that “repeat sequences' are a prevalent component of the 
genomes of multicellular eukaryotes generally,” and 
that they are transcribed in a highly regulated manner, 
especially in the early stages of embryonic development 
(Chapter 16). 

These results also explained the earlier observation 
that eukaryotic DNA often exhibited the presence of 
“satellite” bands in equilibrium density gradients, indi- 
cating a substantial fraction with different nucleotide 
composition. Heterogeneity of sequence composition 
was subsequently found to occur throughout the mam- 
malian genome, due to regional variation in G+C content, 
termed ‘isochores’ by Giorgio Bernardi.???! 

However, because they did not fit the conventional 
conception of a “gene”, the presence of large amounts of 
‘repetitive’ DNA, a term that became pejorative, rein- 
forced the idea that complex eukaryotes could tolerate a 
great deal of nonfunctional DNA, rather than stimulate 
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FIGURE 5.1 Britten and Davidson’s 1969 graph of the increase in the minimum amount of DNA that had been recorded 


for “species at various grades of organization”. (Reproduced from Britten and Davidson'? with permission of the American 
Association for the Advancement of Science.) 
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FIGURE 5.2 Britten and Kohne's 1968 graph of renaturation hybridization kinetics of denatured calf thymus DNA (circles and 
triangles) showing ~40% rapidly renaturing DNA, indicating the existence of high copy numbers (estimated to be an average of 
100,000) of repetitive sequences (left side of graph) and 50%-60% single copy sequences (right side). The curve with -- indicates 


the renaturation kinetics of radiolabeled E. coli DNA spiked into the same assay. (Reproduced from Britten and Kohne" with 
permission of the American Association for the Advancement of Science.) 
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FIGURE 5.3 Ears of corn kept by McClintock to illustrate the transposition of ‘controlling elements’ — the ‘standard’ position of 
the transposable element Ds (Dissociation) located after C (colorless), Sh (shrunken) and wx (waxy) on chromosome 9 (upper ear, 
from 1947) and the transposed position of Ds now positioned before Sh, Bz (bronze) and wx (lower ear, from 1949). (Courtesy of 
Cold Spring Harbor Laboratory Archives (photo: Sarah Vermylen). Reproduced from Chomet and Martienssen * with permission 


from Elsevier.) 


new thinking, with the notable exception of Britten him- 
self and a few others (see below). 

The repetitive sequences discovered by Britten 
turned out to be largely derived from transposons, 
short (SINE) and long (LINE) elements mobilized by 
reverse transcription, and DNA segments mobilized 
by a 'cut-and-paste' mechanism, which range in length 
from a few hundred to 30 thousand base pairs and can 
be excised or transcribed, copied and reinserted into 
genomes at other places, conveying cassettes of genetic 
information.!”22 


CONTROLLING ELEMENTS 


Transposons were discovered in 1948 by Barbara 
McClintock,’ who observed unexpected position- 
dependent effects of mobile DNA segments in maize, 
which changed the colors and color patterns of corn 
kernels (among other things), specifically the mobile 
‘Dissociator-Activator’ (Ds-Ac) elements, and another 
that she called ‘Suppressor-mutator’ (Spm), published 
well before Jacob and Monod's work on the regulation of 
the lac operon in E. coli,?>° with which she later drew 
parallels? (Figure 5.3). 


^ McClintock also made many cytological observations of subcellu- 
lar structures, most notably the “nucleolar organizing region”, now 
known to contain tandem arrays of ribosomal RNA genes.?* 
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McClintock's work was revolutionary. Rather than 
focusing on mutations that affect viability, she developed 
and used cytogenetic analysis of X-ray-induced chromo- 
somal breakage-fusion-bridge cycles for creating and 
mapping mutations.52? She was the first to show that 
genetic recombination involved physical exchange of 
chromosomal segments,?? the first to observe “jumping 
genes' and the first to show that the position of their inser- 
tion alters the expression of nearby genes (for example at 
the bronze locus, which is involved in the biosynthetic 
pathway for the pigment anthocyanin)*! — quite different 
from normal conceptions of regulatory sequences. 

McClintock reported her data on transposition in the 
formal scientific literature in 1950,? and in person in 
1951 at the influential Cold Spring Harbor Symposium on 
Quantitative Biology,? at the conclusion of which geneti- 
cist Evelyn Witkin recalls: 


In her own [McClintock’s] words, the response 
was puzzlement, and in some instances hostil- 
ity. Certainly, as I remember it, there was baffled 
silence after her talk and little or no discussion 
of her densely documented evidence and argu- 
ment for transposable elements and their effects 
on gene expression. The audience seemed embar- 
rassed by the lack of response, understandably, 
because McClintock was respected and admired 
as a great geneticist." Here, her conclusions were 
too radically in conflict with the entrenched 
genetic concept of a stable genome, and her data 
too complex, to allow for rapid or easy accep- 
tance, although a small number of geneticists who 
had come to know her work well believed it to be 
profoundly important.?? 


Highlighting the parallels with position-effect variega- 
tion in Drosophila and heterochromatin (and also citing 
the work of Ed Lewis, see below), McClintock said: 


The behavior of these new mutable loci in maize 
cannot be considered peculiar to this organism. 
The author believes that the mechanism underly- 
ing the phenomenon of variegation is basically the 
same in all organisms ... Is it usually an orderly 
mechanism, which is related to the control of the 
processes of differentiation??? 


McClintock's observations were confirmed by Royal 
Alexander Brink and Robert Nilan studying another 
mobile element called “Modulator”, who concluded that 
their findings “cannot be explained in conventional 
genetic terms" and speculated that Modulator “may 


* On more conventional grounds, McClintock was elected to the 
National Academy of Sciences in 1944 at the relatively young age 
of 42, and in 1945 she was elected the first female president of the 
Genetics Society of America." 
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belong to the obscure category of germinal substances, 
termed heterochromatin, which has been associated ... 
with different variegated phenotypes in Drosophila”.36 
The genetic behavior of these families of elements 
is complex but, in general, they are comprised of some 
small, transposition-competent sequences and a larger 
number of derived transposition-defective sequences? 
whose mobility is dependent on a factor supplied by 
an active member of the family.” Indeed, Ac and Ds 
are autonomous and non-autonomous transposable ele- 
ments,? respectively, explaining the dependence of 
the latter on the former. Other elements, such as the 
Suppressor-mutator element, are more complicated.* 
McClintock eventually received the Nobel Prize in 
1983 for having discovered transposition, coinciding 
with demonstrations by Nina Fedoroff and others that 
Ac encodes a transposase responsible for the transposi- 
tion of itself as well as D5,??9 the identification of TEs 
(transposable elements, see below) in Drosophila ^^-^* and 
bacteria,^^^* and the growing use of TEs as mutagenic 
tools, especially ‘P-elements’ in Drosophila. 
McClintock insisted that her most important finding 
was that such ‘controlling elements’ - which others such 
as Brink preferred to call ‘transposable elements’, as the 
term was “less interpretative’! — played a role in normal 
development, as both the timing and frequency of trans- 
position and associated chromosomal rearrangements are 
developmentally regulated and have developmental con- 
sequences.252652-5+ She also showed that some of these 
elements exhibited reversible mutations.>> 
McClintock conceptualized the signatures of epi- 
genetic control of gene expression decades before it was 
formally recognized and studied. She identified genetic 
elements that were either active or inactive only in the 
crowns of corn kernels, as well as elements that were 
active only in lower side branches (called ‘tillers’) or 
exhibited a different pattern of activity in cobs from the 
tillers and primary stalk.5%5+ This implied the existence 
of an additional regulatory mechanism that determines 
the pattern of TE expression/activity during develop- 
ment. McClintock correctly speculated that this mech- 
anism involved heritable modifications of the gene that 
are not caused by changes to its DNA sequence, later 
shown by Rob Martienssen and colleagues to be due to 
DNA methylation and histone modifications that regulate 


d The human genome, for example, has over 1 million copies of the 
Alu element (comprising over 10% of the genome), only ~5% of 
which are active. The name ‘Alu’ derives from the fact that these 
elements were first recognized as the short (-300nt) repeated DNA 
sequences commonly contain a cleavage site for the Arthrobacter 
luteus (Alul) restriction enzyme.?5 
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heterochromatin formation and are imposed by enzymes 
directed by RNA?9-9? (Chapters 12, 14 and 16). 

McClintock intuited that her controlling elements did 
not encode regulatory proteins but rather functioned as 
regulatory modules. However, when transposons car- 
rying protein-coding antibiotic resistance genes were 
discovered in bacteria by James Shapiro in 1969," her 
theory that TEs are involved in regulating gene expres- 
sion during development was much ignored, to her ongo- 
ing frustration! The fact that the genetic phenomena 
involved were complex and not easily understood, as 
opposed to the simple /ac operon/transcription factor 
operator-repressor model, did not help.? 

Nonetheless, she was the pioneer of the concept of the 
dynamic genome, including in response to stress,“ and 
her ideas were well ahead of her contemporaries. We now 
know that transposon-derived sequences are major sites 
of epigenetic modification, regulate many aspects of gene 
expression during plant and animal development, and are 
mobilized in evolutionary time for phenotypic diversifica- 
tion and in real time in the brain (Chapters 10, 16 and 17). 


PARAMUTATION, IMPRINTING 
AND TRANSINDUCTION 


In 1915, Bateson and Caroline Pellew reported unusual 
"rogue" non-Mendelian patterns of inheritance in peas, 929 
which were occasionally observed thereafter in other spe- 
cies, but were not studied in a systematic way until Brink 
coined the term ‘paramutation’ in the 1950s to describe 
the atypical inheritance of traits displayed by some genes 
in maize, tomato and other species.%-6 While initially 
thought to be restricted to plants, later studies showed that 
paramutation and related phenomena also occur in animals, 
including mammals, likely to a much greater extent than 
has been appreciated, and that it involves the intergenera- 
tional transmission of epigenetic information (Chapter 17). 

Waddington, who pioneered the concept that epigene- 
tic mechanisms control the trajectories of development,” 
reported, also in the 1950s, that exposure of Drosophila 
eggs to ether for a few generations resulted in the inheri- 
tance of a bithorax-like phenotype (see below) in sub- 
sequent (untreated) generations. He also showed that 
another phenotype, ‘crossveinless’ wings, induced by 
heat shock of pupae, showed similar inheritance in later 
generations! Waddington concluded that: 


All phenotypes are modified, to a greater or lesser 
extent, by the environment. All genotypes under 
natural conditions will be subject to selection 
pressures relating to the manner in which their 
development is modified by the environment. The 
phenotypic effect of any new gene mutation must 
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therefore be to some extent influenced by the kind 
of developmental flexibility which has been built 
into the rest of the genotype by selection for its 
response to the environment.” 


Such ‘genetic assimilation’ has profound implications 
(Chapter 18), but was waved away by most theoretical 
and molecular biologists of the time.” It is difficult, in 
any case, to disentangle the interplay of genetic variation 
with epigenetic inheritance,” and recent evidence sug- 
gests that, at least in some instances, the cause is stress- 
induced mutations.” 

In 1974, D. R. Johnson reported that an allele of the 
brachyury gene (which encodes a transcription factor 
required for mammalian gastrulation and early organo- 
genesis") has a lethal phenotype if inherited from the 
mother, but not if inherited from the father. A simi- 
lar phenomenon was described at a translocated locus 
affecting the relative number of male and female off- 
spring by Mary Lyon in 1977.? This 'parental imprint- 
ing’ was characterized independently by the groups of 
Azim Surani and Davor Solter in 1984 as a requirement 
for both female and male genomes for development, due 
to ‘marked’ loci (‘epialleles’) with differential patterns of 
expression depending on parent of origin.50-83 

Parental imprinting only occurs in mammals and is 
thought to be an adaptation to placental biology driven by 
sexual antagonism, maternal-offspring coadaptation and/ 
or kinship interest.?^-*?? A high proportion of the imprinted 
genes have arisen through retrotransposition®* (although 
they are depleted of SINE elements?) and imprinting 
appears to have evolved with the invasion of particular 
classes of repeats at the marsupial-eutherian interface.96?? 
Interestingly, non-coding RNAs are at the heart of the 
regulation of genomic imprinting and X-chromosome 
inactivation in mammals (Chapters 9, 13 and 16). 

Dysregulated imprinting has since been associ- 
ated with disorders such as Angelman and Prader-Willi 
Syndromes in humans,?-% as well as with the callipyge 
(‘beautiful buttocks’) ‘polar overdominance’ phenotype 
in sheep involving cis- and trans-epiallelic interac- 
tions%-% and non-coding RNAs? Similar epigenetic 
mechanisms that regulate transposon activity for genome 
defense and gene regulation were found to underlie 
the phenomena (discovered in the 1980s and 1990s) of 
‘repeat-induced’ and ‘homology-dependent’ gene silenc- 
ing? in fungi, plants and animals!%-1% (Chapter 12). 

Another odd phenomenon, termed “transinduc- 
tion’, was described in 1997 by Alison Ashe, Nick 
Proudfoot and colleagues, who found that transient cell 


* These phenomena have also been referred to as ‘quelling’ and 
'co-suppression'. 
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FIGURE 5.4 (a) Wild-type and (b) a bithorax mutant Drosophila melanogaster. (Reproduced from Akbari et al.!% with permis- 


sion from Elsevier.) 


transfections with a beta-globin gene construct induced 
the expression of non-coding RNAs from the regula- 
tory ‘locus control region’ and intergenic regions of 
the globin locus (but not the endogenous beta-globin 
mRNA), which required transcription of the transgene 
but not its translation.!% That is, ectopic expression 
of an mRNA has regional effects on the gene expres- 
sion profile of its locus in the absence of the encoded 
protein, which indicates that transcription itself and/or 
mRNAs also convey regulatory signals. 


THE BITHORAX ‘COMPLEX LOCUS’ 


Clues pointing to epigenetic control of development and 
the involvement of regulatory RNAs were also emerg- 
ing in the 1940s and 1950s from Ed Lewis’ studies on 
the genetics of Drosophila body segment specification, 
based on mutant and recombination phenotypes, particu- 
larly in ‘homeotic’ genes, which produce transformations 
of segment identity.!07-109 

'Homeosis was initially described and the term 
coined in 1894 by Bateson to describe phenotypic varia- 
tions in which "something is changed into the likeness of 
something else”.!!%!!! The first homeotic gene was identi- 
fied in 1915 by Calvin Bridges,” who identified a muta- 
tion (‘bithorax’) that converted the third thoracic segment 


into the second, producing an additional pair of wings, 
and mapped it to a genomic region later named by Lewis 
the “bithorax complex’ (Figure 5.4).113-119 

Mutations in the bithorax complex showed spec- 
tacular perversions of development. The bithorax phe- 
notype is caused by loss-of-function mutations in the 
protein-coding gene ultrabithorax (Ubx). Other muta- 
tions convert antennae into legs (Antennapedia, Antp) or 
genitalia into legs or antennae (Abdominal-A, abd-A, and 
Abdominal-B, Abd-B), all also caused by loss-of-function 
of ‘homeotic’ proteins.!?-!? [t was subsequently real- 
ized, after the gene cloning revolution (Chapter 6), that 
duplicated paralogous clusters of homeotic genes, with 
the same spatial order of paralogs, occur in vertebrates.!?! 

Lewis found that the spatial expression patterns of 
the homeotic genes and their phenotypes are modu- 
lated by mutations in nearby ‘cis’-regulatory loci called 
anterobithorax (abx), bithorax (bx), bithoraxoid (bxd), 
contrabithorax (Cbx) and postbithorax (pbx), as well as 
infra-abdominal (iab1—4, modulating abd-A; and iab5— 
8, modulating Abd-B), which he termed “pseudoalleles”.!22 


f The evolutionary significance of this transformation was not lost 
on Lewis, who noted the similarity of bithorax mutations to (and 
therefore the ease of generating) the additional set of wings on a 
dragonfly.!? 
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Some of these loci turned out to reside in the introns of 
the protein-coding genes, but most could be easily sepa- 
rated by recombination, indicating that they are discrete, 
separate genetic elements.!? 

These genes also exhibited intriguing position 
effects, trans-heterozygotes being more abnormal than 
cis-heterozygotes, indicating a local interaction, with 
trans-heterozygotes having defective regulation on one 
chromosome and a defective protein on the other, whereas 
cis-heterozygotes had a normal pattern of expression of a 
functional protein from one chromosome.!??.?4 


TRANSVECTION 


In 1954, Lewis discovered that a structural rearrange- 
ment that moved Ubx to a different chromosomal location 
resulted in a mutant phenotype in flies heterozygous but 
not those homozygous for the rearrangement, which he 
attributed to disruption of the pairing of the two alleles.!25 

Based on the 'cis-trans position effect’, Lewis pro- 
posed that pseudoallelic regions control the hierarchical 
expression relations (or ‘polarity’) observed in the cluster 
by either “cis-vection”, in which the substances generated 
by the pseudoalleles acted on adjacent genes, or by “trans- 
vection" based on the proximity (somatic pairing) of 
homologous sequences to explain the influence of alleles 
in one chromosome over the alleles in another.'??.?4 

Transvection, or ‘allelic cross-talk’, has been observed 
at many other loci in Drosophila and, among other things, 
regulates the sexually dimorphic expression of X-linked 
genes.?6-?5 The phenomenon appears to be proximity- 
dependent, but not always.??/?? Some interactions can 
occur at a distance and may not be strictly dependent on 
homolog pairing as translocations and mutations in the 
gene zeste, which normally disrupt transvection, have 
limited or no effects on pairing."! 

To explain transvection, Lewis initially invoked the 
operon model, postulating that pseudoallelic loci contained 
genes responsible for the synthesis of substances that are 
coordinately regulated by an operator element!%1% and that 
a locally produced diffusible substance would activate a 
cascade of reaction steps involving the other substances (the 
“sequential reaction model").??.?^ However, if these sub- 
stances were proteins their production had to occur close to 
their sites of action because of the proximity effect, i.e., in 
the nucleus, whereas translation occurred in the cytoplasm. 

Consequently, like transposition, it was problematic 
to explain the ‘pairing-dependent’ transvection phenom- 
enon in conventional terms, unless “homologous chro- 
mosomes cooperate with one another in the transcription 
process”.10%8 Lewis then floated the idea that substances 


53 


“used only briefly in development" are synthesized “at 
the chromosomal level”.!° 

While the obvious candidate was RNA, Lewis sug- 
gested that the substances involved might be RNAs, 
polypeptides, or products of the enzymatic activity of 
such polypeptides, and was careful not to make 
assumptions,? famously saying that “The laws of genetics 
had never depended upon knowing what the genes were 
chemically and would hold true even if they were made of 
green cheese", perhaps a frustrated reaction and pointed 
riposte to colleagues who told him that he “was simply 
dealing with missense and nonsense mutants within a 
protein and that all we were doing was mapping sites 
within a single protein coding unit!”.132,134 

Decades later, in the 1980s, David Hogness (who cloned 
the firsteukaryotic transposon), Mike Akam and colleagues 
found that the bithorax complex produces multiple overlap- 
ping and intergenic RNAs, many of which do not encode 
proteins, 5-73 including those from the (presumed) cis-reg- 
ulatory sequences encompassing Polycomb and Trithorax 
‘response elements'?€/9 (see below). For example, the 
bxd locus produces a 27 kb transcript whose expression is 
highly regulated during embryogenesis, in a pattern that is 
partially reflexive of the expression pattern of Ubx.!35-138 
The regulatory elements of abd-A and Abd-B are also 
transcribed.!+0-142 

With this evidence in mind, others also suggested that 
transvection involves RNA transactions,!?-!^ concluding 
in one case that “the genetic analysis of these transvec- 
tion effects suggests that the transcription of the CbxIRM 
and Cbx2 alleles depends on RNAs of short radius of 
action from the homologous Ubx gene"? Subsequent 
studies showed that transvection occurs by sequences on 
one homolog directing the expression of the cognate tran- 
scription unit on the other, ^6 and that these sequences are 
‘enhancers’,!47-!>° genes that control the spatial expression 
patterns of nearby and distal protein-coding genes during 
development and produce non-coding RNAs in the cells 
in which they are active (Chapters 14 and 16). 

Like other genes and genetic phenomena affect- 
ing development that were first discovered in maize or 
Drosophila and thought to be idiosyncratic, transvection 
was subsequently found to occur in fungi, plants and 
mammals, and to be a general feature of multicellular 
eukaryotes, although still poorly understood.!*!.132 


* Lewis later capitulated, but not completely, to the conventional view 
of gene regulation, stating in 1994 that: "The remaining regions are 
thought to include enhancer-like sequences to which regulatory pro- 
teins bind, thereby conferring spatial- and temporal specific produc- 
tion of the proteins encoded by the complexes."!!! 
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EPIGENETIC MODIFIERS 


The other major finding of this period, whose impor- 
tance was also not broadly appreciated until much later, 
was the identification of genes that had a global effect 
on the expression of homeotic genes, termed Polycomb 
Group (PcG) and Trithorax group (TrxG). The first PcG 
gene, Polycomb (Pc) was discovered in 1947 by Pamela 
Lewis,^? so named because the most common effect 
of PcG mutants is the appearance of extra pairs of sex 
combs on the legs of the second and third thoracic seg- 
ment in males, a feature normally found only on the legs 
of the first thoracic segment. These mutants are lethal 
when homozygous but cause mild homeotic transforma- 
tions when one copy is incapacitated, a semi-dominant 
‘haplo-insufficient’ phenotype. 

Related genes with similar effects, such as Extra sex 
combs, Sex combs on midleg and Additional sex combs, 
were subsequently also identified.!**-156 In 1978, Lewis 
observed that Pc "seems to code for a repressor of the 
complex", as it repressed expression of Ubx in anterior 
segments, causing them to be transformed into more pos- 
terior ones.!? 

Mutants that had the opposite effect were also identi- 
fied, notably by Phil Ingham in the 1980s. These mutants, 
including Trithorax and others such as Enhancer of 
zeste, cause embryonic segments to be transformed into 
anterior ones by antagonizing PcG proteins.P?-157-16! 
Many TrxG genes were subsequently identified through 
genetic screens for mutations that suppress the pheno- 
type of PcG genes.!9?-164 

The observation that PcG and TrxG factors are 
required to maintain homeotic gene expression gave rise 
to the hypothesis that PcG (repressive) and TrxG (acti- 
vating) proteins act as a “cellular memory system'”.165166 
Orthologs of PcG and TrxG genes were later found to 
be ubiquitous in plants and animals!” and to encode 
histone-modifying proteins and ATP-dependent chro- 
matin-remodeling complexes, or repressors thereof, for 
the epigenetic control of gene expression during devel- 
opment (Chapter 14). 

Indeed, if all this sounds complicated, it is, and 
was too much for the operon crowd to digest. Max 
Delbruck, the physicist turned bacteriophage geneti- 
cist, wrote in a 1963 letter to a former member of 
Lewis’ laboratory: “I then plunged into the bithorax 
saga for which Lewis very kindly sent me his latest 
manuscript ... I must say I am puzzled ... [and] ... 
strongly suspect that there is something wrong here 
in the analysis.”!*? 
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THE BRITTEN AND DAVIDSON MODEL 


In 1969, Britten and Eric Davidson" published a paper 
entitled “Gene regulation for higher cells: a theory’, 
which attempted to integrate the findings of the preceding 
decades, including the prevalence of repetitive sequences 
in higher organisms, Davidson’s observations on 'infor- 
mational RNAs’ in amphibian and sea urchin embryos,!”° 
and the data on embryo development that Davidson had 
assembled the year before in his book ‘Gene activity in 
early development”.!”! This paper devoted special attention 
to the enormous size of the genomes of higher organisms, 
the diversity of transcripts in the nucleus and the abundance 
of repetitive sequences that are transcribed in a cell-specific 
fashion. It was the first serious consideration of gene net- 
works (“gene batteries") and regulatory circuits in the evo- 
lution and development of higher eukaryotes.!”? 

Britten and Davidson's theory was partly but not 
entirely influenced by Jacob and Monod's principles (with 
elements analogous to operators, regulator and structural 
genes), but was distinct in that it encompassed the differ- 
ence in genomic organization and transcriptional output 
between eukaryotes and prokaryotes, and emphasized 
the importance of genomic structure in the cascade of 
gene expression during development. An important 
proposition was that “the (normal) state of the higher cell 
genome is histone-mediated repression and that regula- 
tion is accomplished by specific activation"? 

Britten and Davidson proposed that genes in eukaryotes 
were differentiated into functional classes: structural (or 
"producer" genes (which encode proteins) and “integrator 
genes" (which encode regulatory molecules), the latter of 
which was influenced by McClintock's model of gene reg- 
ulation, developed before it was known that transposable 
elements were the main sources of repeated sequences. 

As in Jacob and Monod's original model, Britten and 
Davidson posited that regulatory genes produce RNAs. 
They extended this notion to suggest that such RNAs would 
connect regulatory networks to activate the gene batter- 
ies that specified the phenotypes of different cell types. 
Focusing on regulation at the level of transcription, the 
proposed advantage of RNAs as regulatory molecules was 
again its base pairing ability, which allowed genomic regu- 
lation by recognition of specific “receptor sequences".? 


^ Davidson was a colorful character. He played American football 
at a senior level, rode a Harley-Davidson, and was lead singer for 
an Appalachian folk music ensemble and played banjo in the Iron 
Mountain String Band.!%1% He had worked with Mirsky and he and 
Roy Britten were part of the Caltech group that included Delbruck 
and Leroy Hood (who later invented the first automated DNA 
sequencer, Chapter 10), one of a number of overlapping schools of 
thought that influenced research and concepts throughout the early 
period of molecular biology. McClintock was an outsider. 
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A major factor in their proposal was the literature 
describing chromatin-associated RNAs and the huge het- 
erogeneous nuclear RNAs that contain repetitive sequences. 
According to the model, RNAs bind in a sequence-specific 
manner to DNA to provide the basis of specific gene regu- 
lation, explaining the complexity of RNAs in the nucleus 
compared to the cytoplasm. It would also explain the differ- 
ent spectrum of RNAs transcribed from repeated sequences 
during early development and cell differentiation.70173 

Britten and Davidson thought that, while it was pos- 
sible either RNA or proteins were the diffusible regulatory 
molecules, RNA represented the “simpler alternative ^ In 
addition, according to them, the potential of formation of 
new gene batteries (acting in a specific cellular program) 
appeared to differ greatly between the two alternatives, 
because RNA specificity would only depend on target 
sequence complementarity, with implications for the mod- 
ularity and co-evolution of these sequences (Chapter 16). 

They were also motivated by early measurements of the 
cellular DNA content in different species, which showed 
that the amount of DNA per cell was greater in "higher" 
than in “primitive” forms of invertebrates."^ Britten and 
Davidson plotted the amount of DNA of a vast range 
of organisms (from viruses and bacteria to mammals) 
and were able to show that the minimum genome size 
greatly increased concomitant with increased complex- 
ity of their organization. They then reasoned that most 
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of the known biosynthetic pathways were already present 
in unicellular organisms, and that a significant increase 
in the number of ‘structural genes’ in higher organisms 
was unlikely.^ Therefore, they posited that the differ- 
ence between sponges and mammals would lie in the 
increased complexity of regulation, which corresponded 
to the “integrator genes” and “receptor sequences”. 
The expansion of non-protein-coding sequences, but not 
coding sequences, in animal genomes was later con- 
firmed by genome sequence data (Chapter 10). 5-177 

Finally, Britten and Davidson proposed that the repeti- 
tive sequences played a role in coordinating the expression 
of gene networks during development and differentiation. 
In fact, Britten and David Kohne in their original paper 
published in the previous year, perhaps aware of the grow- 
ing consensus that repetitive and other non-protein-cod- 
ing sequences in the huge genomes of higher organisms is 
junk (Chapter 7), stated: “A concept that is repugnant to us 
is that about half of the DNA of higher organisms is triv- 
ial or permanently inert (on an evolutionary time scale). 
Furthermore, at least some of the members of DNA fami- 
lies find expression as RNA. We therefore believe that the 
organization of DNA into families of related sequences 
will ultimately be found important to the phenotype.”” 
In their model, the repeat elements comprise "receptor 
genes" adjacent to “producer genes", functioning as target 
sequences for regulatory RNAs (Figure 5.5). 
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These ideas were extended in subsequent publica- 
tions during the 1970s, including proposed molecular 
mechanisms for gene-by-gene rewiring in development 
and evolution.!”*-18 They were also supported by Nina 
Fedoroff's observation that "large RNAs in the mam- 
malian cell nucleus contain elements complementary 
to ones in other large nuclear RNAs",*? although it was 
later noted by Pederson that “this intriguing finding was 
never pursued by this or any other group, at least as the 
extant literature speaks" .!* 

Importantly, Britten and Davidson developed the 
idea that multicellular development was dependent on 
an extensive, interwoven regulatory architecture embed- 
ded in the structural organization of the genome, and 
that developmental 'novelty' — phenotypic diversifica- 
tion — is achieved mainly by variation in the regulatory 
architecture, "?.155-157 not the repertoire of encoded pro- 
teins (although there are lineage-specific expansions of 
protein families, such as homeotic genes in vertebrates 
and olfactory receptors in mice, and some novel proteins 
that appear from time to time in evolution). The concept 
that phenotypic diversity in plants and animals is largely 
achieved through mutations altering regulatory circuits! 
was extended by Mary-Claire King and Allan Wilson 
in their 1975 paper “Evolution at two levels in humans 
and chimpanzees'5* and Jacob's 1977 paper ‘Evolution 
and tinkering’,!®° and is now well accepted in the field 
referred to as *Evo-Devo',?9?! even extending to the 
evolution of enzyme activity.'?? 

Waddington commented that the Britten-Davidson 
model of the mechanisms controlling embryogenesis was 
“the first ... to make sense".?? 


BOOLEAN MODELS OF 
COMBINATORIAL CONTROL 


In parallel with Britten and Davidson, a Boolean net- 
work function for the regulation of gene activity was 
advanced by Kauffman.!*%1%-1% Kauffman invoked 
bacterial operon promoter-repressor systems as the 
model and suggested that combinations of binary 
(‘on-off’) DNA-protein interactions could produce 
stable gene activity patterns from small numbers of 
variables, later embraced by Davidson.!'95127-19% Such 
approaches (which did not consider but does not pro- 
hibit regulatory RNAs) have been used to correctly 
predict some gene expression profiles;?520020! they 
also gave comfort to the generally accepted idea that 


! A feature especially amenable to transposon insertion and associated 
epigenetic control. 
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combinatorics of protein interactions are sufficient to 
account for ontogeny, although it has otherwise had lit- 
tle impact on the consciousness of the molecular biolo- 
gists traditionally concerned with nuts and bolts. 

In any case, by the beginning of the 1970s, two 
major problems identified in the 1960s were becom- 
ing generally recognized: for what purpose is so much 
of the DNA in a cell transcribed, and why does only 
a proportion of nuclear hnRNA contain the poly(A) 
sequence typical of mRNAs? A commentary in the 
journal Nature at the time remarked that "all the atten- 
tion has focused on proving the existence of the mes- 
senger proportion of the HnRNAs", and then predicted 
that "In the future it may be the other sequences, among 
which controlling elements might be found, that will 


>> 202 


command more interest". 


PROCESSED RNAs AS GLOBAL REGULATORS 


Explicitly influenced by the ideas of Britten and 
Davidson, as well as of Vogel,'? Crick (later a proponent 
of selfish and junk DNA, Chapter 7) ventured in 1971 
that most DNA of ‘higher organisms’ does not encode 
proteins and proposed a general model for the chromo- 
somes of higher organisms.?? Crick's model was based 
primarily on cytogenetic studies of Drosophila poly- 
tene chromosomes, in which dense bands contained 
high concentrations of DNA and histones separated by 
less dense 'interband' areas (see Chapter 14 for modern 
views on chromatin and genome organization). He sug- 
gested that the protein-coding sequences were found in 
the small fraction of fibrous DNA characterizing the 
interbands. On the other hand, the dense chromosome 
bands corresponded to “globular structures", which 
comprised most of the genome and which, he proposed, 
would contain “unpaired DNA” available for gene con- 
trol, the “Unpairing Postulate’. Crick speculated that 
“this postulate has even more force if single-stranded 
RNA is also used to recognize the control sequences 
on DNA elements”, and that repetitive sequences may 
be the specific interaction sites regulated by interactions 
with histones.205 

Based on the properties of RNA, Gerald Kolodny 
proposed in the early 1970s that regulatory RNAs 
originating from the breakdown of hnRNAs consti- 
tuted major drivers of cell differentiation during devel- 
opment. Kolodny hypothesized that the derived “short 
activator RNAs” could base pair with unique single- 
stranded DNA sequences in control regions of the 
target genes (such as in areas of the chromatin where 
the DNA is exposed and able to be experimentally 
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cleaved by DNasesi) and promote regulation at the 
transcriptional level.204205 According to this model, 
hnRNA processing could occur either in the nucleus 
or in the cytoplasm, as there was evidence from Lester 
Goldstein and colleagues that short RNAs are shut- 
tled back to the nucleus, including RNAs that then 
would become associated with chromatin;206209 the 
short RNAs would act as “primers” for transcription 
of one or several mRNA species containing matching 
sequences.?"4 

In addition, Kolodny and colleagues showed that 
RNAs are secreted by mammalian cells, confirming 
Pierre Mandel and Pierre Métais’ 1948 report of RNAs 
circulating in plasma and transmitted between cells.210-212 
In Kolodny’s model, activator RNAs are derived from 
stored maternal RNAs (thus having a role in inheritance) 
and initiate an unfolding developmental program during 
embryogenesis.?04205,211,212 

Burke Judd and Michael Young suggested in 1974 that 
individual chromosomal subunits (or “chromomeres”) cor- 
responded to cistrons and were modules of gene regulation 
in eukaryotes.?? This idea was based on evidence that large 
proportions of chromomeres are transcribed into very large 
hnRNAs? and likely processed into one or several mRNAs. 
Judd and Young hypothesized that these segments contained 
much more information than protein-coding sequences, and 
yet acted as a single operational unit. This theory of “one 
cistron — one chromomere' considered that the large propor- 
tion of DNA that was transcribed, but apparently not trans- 
lated, might be “coding for regulatory functions", conveying 
information for regulating transcription and post-transcrip- 
tional maturation and translation of that unit. Finally, they 
proposed that some of the “extra RNA” released during the 
processing of the large transcripts could activate other cis- 
trons of a (related) biosynthetic or developmental pathway, 
in which mutations might manifest in a pleiotropic (i.e., mul- 
tilateral) fashion.?? 

In 1975, Stuart Heywood and colleagues proposed 
mRNA regulation by “translational-control RNA”, short 
RNAs hypothesized to be generated by processing of 
hnRNAs in the nucleus,?/?2/6 conceptually presaging the 
biogenesis and action of microRNAs discovered 30 years 
later (Chapter 12). 

In 1976, George Brawerman developed an RNA- 
based model for the control of transcription during mul- 
ticellular development “distinct from those operating in 
bacteria", mediated by "primer RNAs" that could bind 


i ‘DNase hypersensitivity sites’ occur at different chromosomal 
positions in different cell types and were later exploited to map the 
positions of transcription initiation and transcription factor binding 
sites in genomes (Chapters 11 and 14). 
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by complementary base pairing to target sites in the 
genome, including repetitive sequences, and be abnor- 
mally expressed in cancer. He hedged his bets however, 
noting that recent studies had also indicated “that unique 
proteins associated with the DNA in chromatin appear to 
be directly responsible for the specificity of transcription 
... [which does] not seem to leave any room for a specific 
role of primer RNA”.217 

In the same year, Elizabeth Dickson and Hugh 
Robertson published an article entitled “Potential regu- 
latory roles for RNA in cellular development ?!521? 
They proposed that RNAs may represent “informed sig- 
nals” that regulate gene expression by interacting either 
“directly or indirectly" with DNA.?1$ They used examples 
of emerging cellular phenomena involving RNAs (such 
as the existence of ‘non-coding’ infectious RNA mole- 
cules, called ‘viroids’, in plants; Chapter 8) to highlight 
the biological capacities of RNAs, and canvassed the 
features that made RNAs “prime candidates" as regula- 
tory molecules: rapid turnover by nuclease degradation; 
the ability to fold in complex ways, with globular regions 
"reminiscent of protein structure" that are ideal for inter- 
action with proteins; base pairing specificity for interac- 
tion with nucleic acids; and, finally, high (informational) 
"coding" capacity. 

They highlighted that an RNA sequence of just 
17 bases is sufficient to specify a unique region in the 
human genome, compared to the information required 
to produce a regulatory protein such as the /ac repressor 
(~1,000 nucleotides). Thus, they suggested, besides the 
protein regulatory systems, “the use of RNA as an addi- 
tional control element would add flexibility, efficiency, 
and elegance to a logical system of gene control"?! 

Dickson and Robertson speculated about the possible 
sources of regulatory RNAs and their mechanisms of 
action, including the regulation of transcription, translation, 
target degradation and even DNA replication. In short, like 
previous suggestions, they indicated that hnRNAs could 
be processed in the nucleus in a cell-type specific manner 
into mRNAs and "extra RNAs", generating a "stable RNA 
molecule from a previously unstable RNA (precursor) 
species" that could act, for example, as trans-acting tran- 
scriptional primers that coordinately promoted expression 
of specific genes. Finally, they also proposed that RNAs 
could be utilized as signals to transfer external information 
to the genome during cellular development, resulting in a 
change in the state of differentiation.?!* 

This model not only explored the potential of RNA 
regulators, but integrated it with the broader spectrum 
of regulatory options in eukaryotic cells, involving both 
proteins and RNAs. It was remarkably prescient. 
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OUT ON A LIMB 


Receptive reactions, however, were not common, and 
most contemporaries preferred the perceived simplic- 
ity of a protein-based regulatory schema, believing that 
RNA regulation was unnecessary, and that such models 
were flights of fancy. 

As Ellen Rothenberg, who worked with Davidson, 
wrote in her obituary of him:??? 


Now, by the 1970s, most regulatory biologists in 
my own molecular biology orbit (at Harvard, MIT, 
University of California San Francisco, and the Salk 
Institute) had been massively influenced by Jacob 
and Monod's work, by models of bacterial operon 
regulation, and by the precedents for elegant à phage 
regulation of lytic vs. lysogenic growth by a mini 
network of mutually antagonistic activator/repressor 
proteins.?21-226 How could these be skimmed over so 
lightly in a book about differential gene regulation as 
the foundation for development? 

It was not just this particular work of Eric's that 
failed to draw upon Jacob and Monod. Interestingly, one 
of the most controversial predictions in the 1969 Britten 
and Davidson paper was that regulatory RNAs rather 
than regulatory proteins might be responsible for com- 
plex gene regulation.? Yet this was presented without 
regard for the clear evidence already in hand at the time 
that gene regulatory molecules were proteins in these 
bacterial systems. Why? Asked about this many years 
later, Eric often explained that for him in the 1960's, 
the evident differences between bacterial gene regula- 
tion and complex eukaryotic gene regulation in develop- 
ment completely dwarfed the similarities. Hybridization 
kinetic analyses of bacterial and multicellular eukaryotic 
genomes had already showed these to have vastly differ- 
ent kinds of sequence organization, with a severe paucity 
of repeat sequences in the bacterial genomes compared 
to the multicellular eukaryotes. If these were regulatory 
sites, then bacteria were missing this kind of regulation. 

Also, Eric's view of development was that this irreversible, 
hierarchical process of increasing complexity that he was 
interested in was so different from the reversible, physiologi- 
cal nutrient responses of bacteria that there was no reason to 
posit the same kinds of molecular mechanisms. In this way, 
Eric and Roy were indeed charting their own course. But 
were they actually solving developmental mechanisms??? 


It seems that many felt the same way, and the thought- 
ful models of Britten and Davidson, and Dickson and 
Robertson, and their ideas of regulatory RNA, although 
they attracted some attention at the time, were ultimately 
sidelined and overrun in the excitement of the emerging 
gene cloning revolution. 
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Brow-beaten by the orthodoxy,‘ Davidson spent the 
rest of his highly successful and influential career study- 
ing the role of transcription factors and gene regulatory 
networks in sea urchin development!”?,185-187,197,198,220,228-233 
(Chapter 15). Davidson did however report that “non- 
translatable transcripts containing interspersed repeti- 
tive sequence elements constitute a major fraction of the 
poly(A) RNA stored in the cytoplasm of both the sea 
urchin egg and the amphibian oocyte”,2% one of the first 
explicit descriptions of long ‘non-coding’ RNAs beyond 
those in the ribosome (Chapter 9). 

Roy Britten remained studying repetitive DNA mainly 
from an evolutionary perspective but not specifically as 
regulatory cassettes, although he did acknowledge their 
importance as sources of variation,2%2% a theme that 
would be picked up later by others. 
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6 The Age of Aquarius 


While the nature of gene regulation in higher organisms 
was being mooted, hamstrung by lack of molecular data, 
elsewhere a technological revolution was taking place, 
which would provide the toolkit to determine the rep- 
ertoire of genes, their structures and products, and ulti- 
mately the composition of entire genomes. 

The pace of molecular biology research and discovery 
was accelerating because of the expansion of the univer- 
sity and research sectors after World War II, especially in 
the 1970s when a new ‘baby boomer’ generation entered 
the workforce, eager to embrace new ways. The focus 
on bacterial molecular biology dissipated and there was 
a “mass migration in biomedicine” into “higher organ- 
isms” with the advent of gene cloning and the use of ani- 
mal viruses to understand cancer.! 


RECOMBINANT DNA AND ‘GENE CLONING’ 


In 1972, Paul Berg, Stanley Cohen and Herbert Boyer? 
ushered in the biotechnology era by demonstrating that 
DNA could be cut and joined in vitro to make a ‘recom- 
binant DNA molecule that could be propagated in a 
host cell to generate large numbers of copies by clonal 
amplification. 

The roots of this advance lay in bacterial genetics, 
specifically in another strange phenomenon called *host 
restriction-modification’, discovered in the early 1950s 
by Salvador Luria, Mary Human, Giuseppe Bertani and 
Jean Weigle, whereby a bacteriophage infects a different 
strain of a bacterium (initially of E. coli and the closely 
related Shigella dysenteriae) orders of magnitude less 
efficiently than it infects the host strain from which it was 
derived, and reciprocally in reverse.?-^ 

A decade later, Werner Arber and Daisy Dussoix, 
and Mathew Meselson and colleagues, showed that 
this odd behavior was a manifestation of a bacterial 
defense system comprised of an enzyme (a ‘restriction 
endonuclease’) that cleaves foreign DNA at or near a 
specific nucleotide sequence, and a complementary 
enzyme that insulates the same sequences in the host 
genome, typically by methylation of one of the nucleo- 
tides. Most copies of invading viral genomes are cut by 


a Berg, Cohen and Boyer were all part of the south San Francisco 
UCSF/Stanford community, the epicenter of the early development 
of recombinant DNA technology. 
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the nuclease, but a few become protected by the modi- 
fication and escape cleavage. These survivors are then 
able to replicate efficiently, be insulated and infect the 
same bacterial strain with high efficiency, but not the 
original strain or others with different host restriction- 
modification systems.?-? 

Different types of restriction endonucleases were 
defined:^ Type I enzymes cleave DNA at a random dis- 
tance, up to one kilobase, away from a complex recogni- 
tion site? Type II enzymes, of which there are several 
subtypes, recognize and cleave palindromic recognition 
sites (such as GGATCC), usually 4—8 base pairs in length, 
to produce either blunt or overhanging (complementary 
or ‘sticky’) ends by staggered cleavage; Type III enzymes 
recognize two separate non-palindromic sequences that 
are inversely oriented, and cleave 20—30 base pairs from 
the recognition site; and Type IV enzymes recognize 
and cleave specific modified, usually methylated, DNA 
sequences.* 

The nomenclature follows a convention of an abbre- 
viation of the name of the species and the order in which 
the enzyme was isolated, for example, E. coli strain 
RY13 enzyme 1, is called EcoR1. Thousands of different 
restriction enzymes are now known, including artificial 
enzymes produced by engineering, many of which are 
available commercially, fostering a molecular biology 
service industry to supply enzymes, cloning vectors and 
other tools (see below). 

The enzymes studied by Arber and Meselson were 
Type L which had limited utility. In 1970, however, 
Hamilton Smith and colleagues isolated the first Type II 
restriction enzyme (from Haemophilus influenzae, 
HindID) which enabled the reproducible cleavage of 
DNA molecules at specific sequences.!715 This was used 
by Kathleen Danna and Daniel Nathans to construct the 
first physical map of a genome, that of simian virus 40 
(SV40), using size separation by gel electrophoresis of 
the resulting fragments, ushering in the era of ‘restriction 
mapping’.!° 


^ Type I restriction enzymes restrict the influx of foreign DNA via hor- 
izontal gene transfer while maintaining sequence-specific methyla- 
tion of host DNA and have the ability to change sequence specificity 
by domain shuffling and rearrangements.'>.'¢ 

* And more recently, Type V, whose cleavage sites are determined by 
guide RNAs (the CRISPR systems; Chapter 12). 
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Italso brought about the era of recombinant DNA tech- 
nology, as restriction fragments from different genomes 
could be mixed and joined together. This required 
another innovation — the purification and use of DNA 
‘ligases’ capable of joining complementary or blunt DNA 
ends by the formation of a phosphodiester bond, isolated 
in 1967 by Bernard Weiss and Charles Richardson.?? The 
first recombinant DNA molecule was produced by Paul 
Berg, Bob Symons and colleagues in 1972, who mixed 
EcoRl-cleaved SV40 DNA with a DNA segment con- 
taining lambda phage genes and the galactose operon of 
E. coli?! but fearing the dangers that might be created, 
declined to introduce these molecules into living cells.?? 

That was left to Cohen, who in 1973-1974, together 
with Annie Chang, Robert Helling, Boyer? and other 
colleagues, ligated EcoRI restriction fragments from 
Staphylococcus aureus and the frog Xenopus laevis with 
a plasmid that contained a replication origin, an antibi- 
otic resistance gene (for selection) and was cleaved just 
once (i.e., linearized) by EcoRI. They reintroduced the 
recombined molecules into E. coli using a CaCl, (‘trans- 
formation’) procedure developed by Cohen.?*-26 These 
experiments showed that, to first approximation, genes 
could be successfully exchanged between species by 
human intervention.* 

In 1977, Boyer's laboratory developed the first plas- 
mid vector specifically designed for gene cloning, called 
pBR322, which was small, ~4kb, and had two antibiotic 
resistance genes, one for selection of transformants and 
the other with unique restriction enzyme sites for DNA 
insertion to enable identification of recombinant plasmids 
(Figure 6.1).2 


d Boyer was well aware of the potential, having written a review on 
DNA restriction and modification systems the year before.?? 

* These advances led to the famous Asilomar Conference on 
Recombinant DNA technology in 1975, which "placed scientific 
research more into the public domain, and can be seen as applying 
a version of the “precautionary principle’ via an initial voluntary 
moratorium and then strict controls on recombinant DNA con- 
struction and the release of genetically modified organisms into the 
environment", with “one felicitous outcome [being] the increased 
public interest in biomedical research and molecular genetics .. [and 
stimulation of] knowledgeable public discussion some of the social, 
political, and environmental issues that are and will be emerging 
from genetic medicine and the use of genetically modified plants in 
agriculture”.272 The participation of the public in the implications, 
applications, ‘ethical’ considerations and prescribed limits of genetic 
technologies was to be revived again with the later advent of tech- 
niques for precise engineering of animal and plant genomes (Chapter 
12). Many genome research programs, notably those funded by 
Genome Canada, have required a proportion of the funding to be 
allocated to the social, ethical, economic, environmental and legal 
aspects of the work. 
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More sophisticated cloning vectors were developed, 
notably by Joachim Messing and colleagues from bacterio- 
phage M13, containing multiple clustered sites (MCS) for 
restriction endonuclease cleavage, whereby double diges- 
tion prevents self-ligation and only allows re-circularization 
with a compatible insert.*! Later versions contained an MCS 
within the /acZ gene, which allowed identification of colo- 
nies containing recombinant plasmids based on colorimet- 
ric detection of the encoded enzyme (beta-galactosidase) 
activity (blue colonies) or lack thereof (white, disrupted by 
an insert), and the ability to isolate single-stranded forms 
to aid DNA sequencing.*°-? Cloning sites were also added 
into other genes that enabled direct (positive) selection of 
recombinant clones (e.g., ??). 

Many other variations and elaborations were then, and 
still are, being developed on the core requirements of a 
*vector' (plasmid or virus) capable of being replicated 
in a desired host," a selectable marker (usually an antibi- 
otic resistance gene or metabolic enzyme to complement 
a deficiency in the host) to discriminate transformants 
from non-transformants,* a restriction (‘cloning’) site (or 
battery thereof) to insert foreign DNA, a means of favor- 
ing^ or discriminating recombinant clones from those 
containing vector alone and a means of identifying the 
desired insert (the target gene to be cloned)! among the 
many others that may be produced from restriction endo- 
nuclease digestion of the input DNA. 

Because the production of an encoded protein was 
an important scientific and commercial objective, many 
host-vector ‘expression’ systems were developed in the 
following decades to enable the high-level transcription, 
translation and purification of the encoded protein (see, 
e.g., +35), often assembled in gene cloning protocol man- 
uals that became ubiquitous in this period 


ENABLING TECHNOLOGIES 


The practical potential of the technology was realized 
immediately by Cohen and Boyer, who patented their 
method and started the first of the new generation of 


f That is, having an origin of replication that is recognized by the host 
cell. 

DNA transformation by CaCl, treatment of cells is inefficient, -107* 
at best; more efficient methods were developed later. 

By using two different restriction enzymes so that the vector could 
not be re-joined without an insert. 

Cloning genes that would complement a deficiency in the host cell, 
usually bacteria or yeast, was relatively straightforward. 

Two of the most popular have been *Molecular Cloning: A Laboratory 
Manual’* and ‘Current Protocols in Molecular Biology”, both regu- 
larly updated in new editions or volumes (see, e.g., 31), 
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(a) The procedure used by Herb Boyer and colleagues to produce the plasmid cloning vector pPR322 containing 


unique restriction endonuclease sites with its two antibiotic resistance genes, such that recombinant plasmids may be identified by 
the insertional loss of one or the other. (Reproduced from Bolivar et al.” with permission of Elsevier.) (b) pBR322 was the precur- 
sor to more sophisticated cloning vectors such as pUC19% containing multiple clustered unique sites for restriction endonuclease 
cleavage and strategies to physically favor or genetically select for recombinant molecules. pUC19 image (https://www.addgene. 
org/50005/) generated by SnapGene software (snapgene.com). (Courtesy of Addgene.) 


biotech companies, DNA X and Genentech, respectively, 
while Berg was a major proponent of strict regulation. 
The initial targets were genes encoding medically 
important hormones, such as insulin, growth hormone 
and erythropoietin, among others. The difficulty was 
to identify the rare bacterial clone that contained the 
desired gene, a needle-in-a-haystack problem, especially 
when dealing with large genomes. This was approached 
via RNA, as it was reasoned that the tissues in which 
the proteins are highly expressed would be an enriched 
source of the corresponding (mRNA) coding sequences — 
a fortuitous approach given the (at that time) unknown 
problem that the protein-coding sequences of most genes 
are not contiguous in complex organisms (Chapter 7). To 
do this, additional technologies had to be developed. 
The first was complementary DNA (‘cDNA’) synthe- 
sis — conversion of mRNAs to a DNA equivalent. This 
was achieved using reverse transcriptase, discovered a 
couple of years earlier by David Baltimore? and Howard 
Temin,* with an oligo dT primer that would anneal to the 
polyA tail of mRNAs to initiate synthesis of complemen- 
tary DNA strands, then conversion of the single-stranded 
copies into double-stranded DNA with DNA polymerase. 


The second was the synthesis of specific DNA (and 
later RNA) sequences, based on phosphonate, phos- 
phodiester, phosphite triester and phosphoramidite 
anhydrous chemical synthesis methods pioneered in 
the 1950s, 1960s and 1970s by Alexander Todd,*+-46 
Har Gobind Khorana,” Robert Letsinger^?^? and Colin 
Reese,???! among others, and adapted to solid phase syn- 
thesis by Marvin Caruthers and colleagues in 1981.25 
These developments enabled the automation of oligonu- 
cleotide synthesis, as well as the incorporation of both 
natural and non-natural bases, novel linkages such as 
phosphorothioate or peptidyl bonds to improve biologi- 
cal stability or interaction strength, and other additions 
such as biotin for oligonucleotide capture (for reviews 
and recent developments see ?!547), a vibrant domain of 
the biotechnology industry. 

Synthetic designed oligonucleotides have become an 
indispensable part of the toolkit for molecular biology 
research and genetic engineering. Their uses encompass 
not only the detection of corresponding DNA or RNA 
sequences by hybridization, including highly parallel 
microarrays and bead arrays for target quantification 
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FIGURE 6.2 Autoradiographs of radiolabeled ColE1 plasmid RNA hybridized to lysed E. coli on nitrocellulose filters contain- 
ing or lacking the plasmid at different ratios (B) 1:100 and (C) 1:1. (Reproduced from Grunstein and Hogness” with permission.) 


and capture,‘ but also primers for DNA sequencing and 
amplification, introduction of restriction sites and muta- 
tions for genetic and protein engineering, construction 
of hybrid and other forms of artificial genes, mutagenic 
screening, production of antisense sequences to block 
gene expression, gene therapy, large scale genome 
engineering? and many others. Much later enzymatic 
methods would be developed, using engineered terminal 
transferase,%%6! which make possible the production of 
longer DNA sequences for synthetic biology and even the 
prospect of using DNA for data storage. 

The third technology was DNA, RNA and protein 
‘blotting’, first developed by Ed Southern in 1975 using 
radioactively (and later biotin) labeled probes (usually 
cDNAs) to detect the location of corresponding genomic 
sequences in a restriction digest displayed by electro- 
phoresis (the eponymous ‘Southern blot’), which 
played an important role in the discovery of ‘genes-in- 
pieces’ (Chapter 7). The RNA equivalent (‘Northern 
blot) was developed by James Alwine, David Kemp 
and George Stark in 1977,% and the protein equivalent 
(‘Western blot")? by Harry Towbin and colleagues in 
1979 using labeled antibodies or other ligands to detect 
specific proteins in electrophoretic displays of cellular 
contents or fractions.9-97 Subsequent variations were 
‘Southwestern blots' and ‘Northwestern blots’® to 
detect DNAs and RNAs bound by specific proteins, 
respectively. 

These were early days, the technology was in its 
infancy and the cloning of specific genes was a major 
challenge, taking months and sometimes years. A typical 
strategy was to construct a cDNA “library” from a tissue 


* Including sequence capture for ‘exome’ sequencing (Chapter 11) and 
targeted RNA sequencing (Chapter 13). 


known to express the gene of interest, often involving 
size fractionation to enrich the desired mRNA (moni- 
tored by in vitro translation and Western blotting of the 
products), insertion of the cDNAs into a phage or plasmid 
vector and transformation into a bacterial host, usually 
E. coli, then screening for the desired clones among the 
tens of thousands of transformants by colony hybridiza- 
tion using radiolabeled oligonucleotide probes, developed 
by Michael Grunstein and Hogness in 1975” (Figure 6.2), 
commonly designed to be specific for a subsequence of 
the encoded protein with minimal codon redundancy. 
Those involved at the time can attest to the considerable 
celebrations that followed the successful cloning of a 
desired gene. 

The revolutionary advance was that individual genes 
and genomic segments could now be isolated, amplified 
and characterized. 


DNA SEQUENCING 


The other technology was DNA sequencing, required to 
verify the identity and understand the details of the cloned 
gene. The first methods were developed in the late 1960s 
by George Brownlee, Fred Sanger and Bart Barrell, who 
used a paper fractionation method to sequence the 120nt 
5S rRNA from E. coli? and by Ray Wu and colleagues 
who used a primer extension approach (copying the 
sequence in vitro using DNA polymerase) to sequence 
the ends of phage lambda.”-? The first complete gene 
(encoding the MS2 RNA phage coat protein) and com- 
plete genome sequences (of phage MS2) were in fact 
RNA sequences, achieved by Walter Fiers and colleagues 
in 1972” and 1976,” respectively, using two-dimensional 
electrophoresis after partial nuclease digestion of the 
phage RNA. 
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In 1977, two new and more generalizable methods 
for DNA sequencing were published, made possible and 
widely applicable by the large amounts of cloned DNA. 
The first, by Gilbert and Allan Maxam, used terminal 
radiolabeling followed by base-specific partial chemical 
cleavage and size separation of the resulting set of frag- 
ments by (one-dimensional) electrophoresis, visualized 
by radiography.’ 

The second, developed by Sanger! and colleagues, 
extended Wu's primer extension method to produce the 
first sequence of a DNA genome (that of bacteriophage 
@X174) using chain terminating dideoxynucleotide ana- 
logs, terminal radiolabeling and size separation of the 
resulting set of fragments by electrophoresis.?*? Sanger 
sequencing (as it became known) quickly overtook the 
Maxam-Gilbert cleavage method, as it was easier to 
implement and more scalable (Figure 6.3). 

Incremental technical improvements were made, 
which increased the length of the sequence reads. The 
next big leap forward was the introduction of fluores- 
cently labeled primers by Leroy Hood and colleagues in 
1986*! and chain terminators by James Prober and col- 
leagues in 1987,? which led to the development of the 
first automated DNA sequencer by Lloyd Smith in the 
same year, using a repertoire of labels that allowed all 
four base-specific chain termination events to be identi- 
fied in single reaction and read by continuous electropho- 
resis past a photodetector, with the data directly analyzed 
by a linked computer (Figure 6.4). The later development 
of ‘sequencing by synthesis’ (SBS) using reversible 
chain terminators in high density on solid phase surfaces, 
resulted in another step change in the volume of data and a 
reciprocal decrease in cost and enabled the industrializa- 
tion and massive parallelization of DNA sequencing that 
led to the genome projects at the turn of the century and 
ultimately to the feasibility of personal genome sequenc- 
ing for precision healthcare? (Chapters 10 and 11). 


THE GOLD RUSHES 


These new technologies led to a stampede in the late 1970s 
and following years to clone and sequence genes or cDNAs 
encoding proteins of interest from bacteria, archaea, fungi, 
plants and animals, the ease of which depended on the 
availability of suitable genetic complementation (for genes 


! Sanger was one of the few people to be awarded two science Nobel 
Prizes in the same category (Chemistry), for protein sequencing and 
DNA sequencing, the other being John Bardeen (Physics) for devel- 
oping the transistor and superconductor theory. The only other dual 
winners were Marie Sktodowska-Curie (Physics and Chemistry) and 
Linus Pauling (Chemistry and Peace). 
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in microorganisms) or tissues in multicellular organisms 
where the gene was highly expressed. For the latter rea- 
son, the first vertebrate cDNAs to be isolated were those 
encoding hemoglobin," immunoglobulins, the chicken 
egg protein ovalbumin, and highly expressed muscle and 
milk proteins. They were followed by many others as the 
technology developed and became adopted across a wider 
spectrum of biological and biomedical disciplines, not 
just biochemistry and genetics, but also botany, zoology, 
microbiology, developmental biology, physiology, phar- 
macology, cell biology, pathology, anthropology, evolu- 
tionary biology, cancer biology, etc. 

Importantly, molecular biology connected plant and 
animal developmental and behavioral genetics with bio- 
chemistry. Many of the genes that affect phenotype in 
model organisms and others in other species were cloned 
and sequenced, leading to an explosion of discovery 
and characterization of whole new families of proteins 
involved in body plan specification, cell differentiation 
and cell biology. Many of these genes turned out to be 
similar from yeast and invertebrates to humans. 

There was scientific gold to be unearthed every where. 
Investigators built their careers on the discovery of an 
important gene and study of its associated biology, all 
the better if the encoded protein may have medical sig- 
nificance." This in turn reinforced the orthodox view of 
genetic information and fostered a generation of molecu- 
lar biologists occupied with the "brutal reductionism" of 
identifying and characterizing genes encoding proteins.” 

Nonetheless, there were wonderful discoveries in the 
decades that led up to the genome sequencing projects 
at the turn of the century. One could — as many have — 
fill a book on these alone: genes controlling cell division 
or enabling host colonization by bacteria; genes con- 
trolling flowering in plants; genes encoding molecular 
machines, all the way from ribosomal proteins to chlo- 
roplasts to flagella and muscle fibers; genes forming the 
cytoskeleton, and those encoding histones and histone 
modifiers, etc. 

The avalanche of protein sequences (deduced from 
cloned genes) also led to the recognition of similar 
functional modules in different proteins, such as pro- 
tease, phosphorylation, methylation, nucleotide binding 
and DNA binding domains, nuclear and mitochondrial 
localization signals, secretion signals, etc., informa- 
tion about which is now housed in databases such as 
Pfam.?594 


m This is a source of unconscious bias by investigators and may 
explain in part why many biomedical studies have proven difficult to 
reproduce.55-?! 
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FIGURE 6.3 One of the autoradiographs presented by Sanger and colleagues in their 1977 paper on DNA sequencing,% showing 
electrophoretic size separation of X174 DNA sequences copied in vitro from specific primers using radiolabeled nucleotide tri- 
phosphates, with each of the four separate reactions containing specific A, G, C or T chain terminating nucleotide analogs (dd ATP, 
ddGTP, ddCTP and ddTTP). The nucleotide sequence is read bottom to top (5'>3') from the ascending fragments in the different 
tracks. Later refinements optimized the reaction conditions, including the ratios of ddNTPs to dNTPs and the use of radiolabeled 
primers to yield even labeling. (Reproduced with author permission from Sanger et al.50) 
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FIGURE 6.4 An example of the output of an automated fluorescent DNA sequencer. (Reproduced from Foret et al.5 with 


permission of Elsevier.) 


HOX GENES 


The cloning in the 1980s by Walter Gehring and others 
of the genes in the Drosophila bithorax complex - studied 
by Lewis, Hogness" and others - identified the ‘homeo- 
tic’ proteins and the core ‘homeobox’ domain,8-1% as 
well as many others identified by genetic screens, which 
enabled their expression patterns during development 
to be monitored. Unexpectedly, work from Mike Akam 
and colleagues revealed that the regulatory loci in the 
bithorax complex did not encode proteins, but rather 
expressed non-protein-coding RNAS,!0!-1% but these were 
overlooked in favor of the homeotic proteins and the pre- 
conception that regulatory regions functioned in cis by 
binding regulatory proteins. 


a Anticipating the first successful cloning of eukaryotic DNA, in 1972 
Hogness proposed using large insert clones to enable the detailed 
study of chromosome structure. His laboratory generated the first 
random clones from any organism in 1974, mapped a cloned DNA 
segment to a specific chromosomal location a few months later, and 
by early 1975, had generated clone libraries encompassing the entire 
Drosophila genome. In the late 1970s and early 1980s, Hogness, 
Lewis, Wellcome Bender and colleagues achieved the first 'posi- 
tional cloning’ of a gene, Utrabithorax (Ubx), and then others, using 
chromosomal “walking” and “jumping” aided by inversions. Many 
of the mutant alleles in the loci studied turned out to be the result of 
chromosomal breakage or transposon insertions rather than altera- 
tions to protein-coding sequences,’ contrasting with the spectrum 
of chemical mutagen-induced single base changes used widely in 
mammalian genetic studies. 


There was also a strong emphasis on finding equiva- 
lents of genes identified in model organisms (includ- 
ing, for example, neurological and transporter proteins) 
in other species by sequence homology, using initially 
a Southern blot variation dubbed 'zoo blots' and later 
sequence similarity.?210%-106 Such approaches led to the 
discovery that not only do homeotic gene clusters occur 
in vertebrates in multiple copies but also that their introns 
(Chapter 7), relative orientation and temporal expression 
patterns (including antisense transcripts) are conserved 
between Drosophila and mammals!%-1!!(Figure 6.5). 

Many other genes involved in Drosophila development 
were found to have human orthologs, a great surprise at 
the time, including that encoding the homeobox-contain- 
ing protein Pax6, which is required for eye morphogen- 
esis in both insects and mammals, indicating a common 
evolutionary origin? despite the differences between com- 
pound and camera-type eyes.!% They also included many 
mutated in human diseases, like the homolog of the fly 
gene patched in the etiology of the skin cancer basal cell 
carcinoma? and the brain tumor medulloblastoma.!!?.1? 


? The evolution of the eye has been a popular and controversial topic 
in evolutionary biology and often cited as an example of 'intelligent 
design’. 

P Tracked down by what is termed ‘positional cloning’ (also referred to 
as ‘forward genetics’), whereby genetic and physical mapping tech- 
niques are combined to home in on the chromosomal locations of 
mutations causing serious genetic disorders (Chapter 11). 
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FIGURE 6.5 Schematic representation of the correlation between the Drosophila homeotic gene complexes and the murine 
(vertebrate) Hox gene network. (Reproduced from Duboule and Dollé!” with permission from John Wiley and Sons.) The upper 
part represents the domains of expression of Drosophila homeotic genes in the embryonic central nervous system (CNS). In the 
central part all the genes belonging to the same subfamily are indicated by the vertical open or closed rectangles, the latter being 
the Hox loci that had been studied by comparative in situ hybridization experiments and whose expression domains had been 
defined at that time. The bottom part schematically represents the antero-posterior boundaries of expression of these genes along 
the fetal CNS and pre-vertebral column. 
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ONCOGENES AND TUMOR SUPPRESSORS 


Many genes that play a role in the etiology of cancer? — 
‘oncogenes’ (in which mutations and activation drive 
cancer) and 'tumor suppressors (whose inactivation 
facilitates cancer) — were unearthed in those years and 
are still being identified today. 

The first oncogene was identified by a brilliant experi- 
ment which sought to understand how the Rous sarcoma" 
virus (RSV) transforms avian cells into a cancerous state. 
RSV was discovered in 1911 by Francis Peyton Rous, 
who showed that this retrovirus, from which Baltimore 
and Temin later isolated reverse transcriptase, was the 
infectious agent present in cell-free extracts of chicken 
tumors that could transmit cancer to other birds,!!® con- 
sistent with observations of others in leukemia!" and 
sarcoma.!!? The surprising finding that cancer could be 
caused by a virus was, as so often the case, not believed 
and was “met with reactions ranging from indifference to 
skepticism and outright hostility”,'*” although Rous ulti- 
mately received the Nobel Prize (55 years later) for his 
discovery. Analysis of viral mutants that could replicate 
but not “transform” cells in culture led to the isolation of 
the v-src gene, encoding a protein tyrosine kinase (which 
phosphorylates other proteins). Just as importantly, v-src 
was shown to be a constitutively activated version of a 
normal human gene, the ‘proto-oncogene’ c-SRC, which 
is mutated in many cancers.!!”20 Subsequent studies, ini- 
tially of Burkitt’s Lymphoma in the early 1980s, showed 
that somatic chromosomal translocations involving the 
c-myc gene could create oncogenic hybrids,?! which also 
occurs in the bcr-abl fusion characteristic of chronic 
myeloid leukemia.!?? 

In 1969, Henry Harris showed that fusion of normal 
cells with tumor cells suppressed their tumorigenicity, 
indicating that cells express genes that control cell growth, 
which are lost in cancers. In 1971, Alfred Knudson and 
others studying rare cases of familial retinoblastoma 
hypothesized that the heritability was due to a loss-of- 
function mutation in one copy of a germline gene, fol- 
lowed by a later de novo (somatic) mutation in the other 
allele: the ‘two-hit hypothesis'.?? Nearly 15 years later 
this led to the identification of the first tumor suppres- 
sor gene, encoding the retinoblastoma tumor suppressor 


4 Cancer is fundamentally a life-threatening disease of metazoans, 
resulting from the reversion of individual cells in complex differenti- 
ated organisms to a primitive, atavistic state,'!* wherein mutations 
disrupt the interactions between ancestral genes that promote cel- 
lular growth and those that control cell division and differentiation 
in multicellular development! (Chapter 15). 

Sarcoma is the collective name given to cancers of connective tissue. 
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protein, RB1.124125 The two-hit hypothesis explained the 
relationship between inherited and acquired mutations 
in cancer predisposition genes, including 7P53, also 
referred to as “the guardian of the genome', which was 
discovered in 1979 by several groups and is mutated in 
about 50% of all cancers.?9.?7 

Such experiments resolved the controversy about the 
common causes of cancer — it is due to mutations that 
result in the ectopic activation of genes that promote cell 
division"! and the loss of function of other genes that con- 
strain cell division and migration.!'? These mutations can 
be inherited, occur spontaneously or be induced by DNA- 
damaging carcinogens and radiation, a complex land- 
scape that is still being mapped by sequencing of tumor 
genomes in thousands of human cancers (Chapter 11). 

Driven by medical need, the most intensely studied 
genes in the human genome? are those involved in cancer 
and other diseases.'*+ Many altered cancer-causing genes, 
such as the breast cancer predisposition genes BRCA/ and 
BRCA2, maintain genome integrity, ^26 with mutations 
leading to the chaotic genomes seen in many cancers. 
Other genes, such as the mismatch repair genes associ- 
ated with Lynch syndrome, cause a high tumor mutation 
burden, creating ‘neo-antigens’!** that can be recognized 
by the immune system.!38-142 


IMMUNOLOGY AND 
MONOCLONAL ANTIBODIES 


Gene cloning also provided molecular insight into the pre- 
viously arcane world of immunology, showing that anti- 
body genes undergo rearrangement and hypermutation to 
generate a wide arsenal of antigen-recognition molecules. 
It was found that cells that express antibodies to foreign 
antigens undergo secondary changes to improve the bind- 
ing of the antibody, as well as clonal selection, ^*^^ as had 
been predicted by Macfarlane Burnett.'? It also led to the 
identification of inflammatory molecules (‘cytokines’) 
that excite immune responses and drive autoimmune 
disorders," as well as tangentially to the development 
and production of mouse monoclonal antibodies in cul- 
ture, pioneered by Georges Kóhler and César Milstein in 
1975" (Figure 6.6) and later humanized by Greg Winter 


* Second only to TP53, and the most popular non-human gene, is the 
mouse Rosa26, which was identified in 1991 by Philippe Soriano and 
Glenn Friedrich as a locus that is ubiquitously active in mammals,"? 
and subsequently widely used for the construction of transgenic mice 
and other species, as well as transgenic human cells.!%-1% The Rosa 
locus encodes two overlapping non-protein-coding RNAs,!? whose 
functions are presently unknown. 

t Which result in dinucleotide repeat instability.?? 
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FIGURE 6.6 Photograph of immortalized cells created by 
a fusion between a myeloma cell line and spleen cells from a 
mouse immunized with sheep red blood cells, showing individ- 
ual clones that secrete ‘monoclonal’ antibodies that lyse sheep 
red cells, indicated by the halos. (Reproduced from Kóhler and 
Milstein! with permission from Springer Nature.) 


and colleagues,!* which have proved so efficacious as 
therapeutics for cancer and autoimmune diseases, among 
other conditions and applications. 


BIOTECHNOLOGICAL EXPLOITATION 


Not only did the gene cloning revolution create a vibrant 
technology support industry, it also transformed the phar- 
maceutical industry. The cloning, engineering and high- 
level expression of genes encoding human hormones such 
as insulin (which had previously been isolated from pig 
and cattle pancreas), erythropoietin (used to stimulate red 
cell production after bone marrow transplantation") and 
growth hormone, among others, spawned multibillion- 
dollar products and companies, such as Genentech and 
Amgen. 

Valuable tools were also developed by gene cloning 
and manipulation, notably the green fluorescent protein 
from jellyfish and its variants with different emission 
wavelengths by Osamu Shimomura, Douglas Prasher, 


u And used illegally by athletes, as is growth hormone, to boost oxygen 
carrying capacity and muscle mass, respectively. 
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Martin Chalfie and Roger Tsien,” and firefly luciferase 
by Marlene DeLuca and colleagues,!5015! which have 
been widely used in cell and developmental biology 
to track the expression of genes fused to these visual 
‘reporters’. 

Many of the genes discovered and characterized dur- 
ing this period were patented for medical or industrial 
use, largely on the basis of the inventiveness of the tech- 
nology and the novel uses claimed for the products of 
these genes, a practice that was later circumscribed.!* 
However, and despite criticism of gene patenting, it had 
the beneficial effect of allowing the development of new 
pharmaceuticals, which require (limited) monopoly 
rights to protect and recover the required massive invest- 
ments in clinical safety and efficacy trials, following the 
tragic teratogenic effects of the anti-nausea drug thalido- 
mide on limb development in embryos.!5 


CELL-FREE DNA AMPLIFICATION 
AND SHOTGUN CLONING 


In 1983, Kary Mullis conceived a brilliantly simple 
strategy to amplify defined segments of DNA in vitro, 
using flanking oligonucleotide primers and DNA poly- 
merases for cyclic, exponential replication of the tar- 
geted sequence, termed ‘Polymerase Chain Reaction’ 
or PCR!*4 (Figure 6.7). The crucial technical advance 
was the use of thermostable DNA polymerases isolated 
from thermophilic archaea,!* originally identified by 
Thomas Brock in hot springs in Yellowstone National 
Park in 1964.15 PCR allowed ultra-sensitive detection 
and amplification of known DNA segments and trans- 
formed gene cloning, genetic engineering and diagnos- 
tic assays for mutations and infectious agents, especially 
viruses. 

However, most genes that had been cloned in those 
days encoded proteins that had been identified biochemi- 
cally or genetically. There were many more, as Craig 
Venter, Mark Adams and colleagues demonstrated in 
1992 when they introduced the concept of ‘shotgun’ 
mRNA cloning and sequencing ('expressed sequence 
tags") to double the number of known human proteins in 
a single publication.!5 There was also great controversy 
when the US National Institutes of Health attempted 
to patent these genes en masse.’ Similar agnostic 
approaches identified thousands of new genes in other 
organisms including plants.!% Importantly, the shotgun 
strategy along with advances in the technology for high- 
throughput DNA sequencing allowed gene discovery on 
an industrial scale!6!-1% and set the foundations for the 
genome projects (Chapter 10). 
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FIGURE 6.7 The principle of exponential amplification of targeted DNA sequences by PCR. Image modified from the US 


National Human Genome Research Institute fact sheet.57 


A WORLD OF PROTEINS 


Throughout this period gene cloning and characteriza- 
tion was almost exclusively focused on protein-coding 
sequences, due to a number of intrinsic and mutually 
reinforcing biases: expectational bias that most genes 
encode proteins; perceptual bias due to the strong pheno- 
types of disabling protein-coding mutations that are read- 
ily observed and genomically mapped; a sampling bias, 
as protein-coding genes are generally highly expressed;" 
technical bias due to the use of oligo(dT) priming of 
cDNA synthesis, which favors mRNAs; the difficulty 
of sequencing vast tracts of non-protein-coding DNA, 
and reticence to do so; and the problem of identifying 


Y Some early work indicated that polymorphisms in the 5' region 
of human protein-coding genes are associated with variations in 
gene expression and disorders, such as hemoglobinopathies and 
hypertriglyceridemia.165-168 

"In general, protein-coding genes are more highly and broadly 
expressed than genes that express regulatory RNAs, which show 
high cell specificity,!%%! although there are exceptions (Chapter 13). 


causative mutations among the many variations in introns 
and ‘intergenic’ sequences. 

The concept of a gene became synonymous with ‘open 
reading frames”, reinforcing the presumed equivalence of 
gene and protein, which in turn had a major influence 
on the interpretation of the discoveries of the mosaic 
structure of eukaryotic genes and the vast tracts of non- 
protein-coding sequences in animal and plant genomes 
(Chapter 7). 

As observed by Ed Rubin and Lewis: 


Ironically, the success in cloning and study- 
ing individual genes dampened enthusiasm for 
an organized genome project, which was seen 
as unnecessary. Over 1300 genetically charac- 
terized genes—nearly 1096 of all the genes in 
Drosophila—have been cloned and sequenced by 
individual labs. This is over twice the percentage 
of genes in any other animal for which both the 
loss-of-function phenotype and sequence have 
been determined. Nevertheless, for flies as well 
as other animals, less than a third of genes have 
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obvious phenotypes when mutated, emphasizing 
the critical importance of genome sequencing as 
a gene discovery method.” 


A very large fraction of discovered proteins in all king- 
doms of life have no known function.!”? 

On the other hand, not only did the genetic and bio- 
chemical approaches used to identify and characterize 
proteins reveal surprising cases of regulatory RNAs 
(Chapters 8, 9, 12 and 13), genome sequencing and 
high-throughput assays later showed that the largest 
class of proteins in the human genome is RNA binding 
proteins. 5.74 
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THE C-VALUE ENIGMA 


Following Avery's demonstration that DNA is the genetic 
material, cytological and biochemical measurements of 
the amount of cellular DNA showed that species have 
a characteristic DNA content: (termed the ‘C-value’ by 
Hewson Swift!) and that the amount of cellular DNA 
broadly increases with developmental complexity if 
taxa are compared on the basis of their minimal DNA 
content? Related studies during this period used the 
drug colchicine to block DNA replication at metaphase, 
enabling the complement and size distribution of chro- 
mosomes also to be determined.^ 

However, anomalies were found. In many taxa, the cel- 
lular DNA content of species varies over a wide range: 
some simple protists and plants such as green algae and 
mosses have more DNA than flowering plants; and many 
plants (including onions, a popular example?) and some 
other protozoans such as amoebae have more DNA per 
cell than mammals??? (Figure 7.1). Since it was assumed 
that more complex organisms* require more genetic 
information (and the understanding of gene structure and 
regulation was derived from microbial genetics and bio- 
chemistry studies), these anomalies led to the coining of 
the term *C-value paradox”? or *C-value enigma"? 

There has been, and remains, considerable speculation 
about the significance of the spectrum of DNA content, 
which is often interpreted as evidence of the ability of 
eukaryotes to maintain superfluous DNA.’ Correlations 
were sought and sometimes found with cell size, involv- 
ing an increase in the number of nuclei and/or copies of 
the genome, possibly to support the metabolism of larger 
cells. 


Measured in picograms / cell, converted to base pairs on the basis 
that 1 base pair = 660 daltons, assuming an equimolar amount of the 
bases, i.e., 1 pg = 10° base pairs, or 1,000 Mb (1 Gb). 

Only in 1956 was it reported that the correct diploid number of human 
chromosomes is 2n = 46,* until which time it had been thought to be 
48, based on studies in the 1920s by Theophilus Painter.? 

The definition of biological complexity is controversial and suscep- 
tible to pedantry. We define three types of complexity: metabolic 
complexity, which is collectively high in microorganisms, and lower 
in plants and animals; developmental complexity, the numbers of 
positionally and functionally distinct cells and structures (Chapter 
15); and cognitive complexity, the ability to process information and 
learn, which is highest in mammals (Chapter 17). 
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The inordinately large amounts of DNA in some 
species transpired to be due to two factors: polyploidy, 
ie. multiple copies of the genome, which occurs com- 
monly in plants and sporadically in animals, especially 
insects;"?! and lineage-specific expansions of trans- 
poson-derived sequences, notably in some fishes and 
amphibians (especially lungfish? and salamanders?),? 
some clades of arthropods?^?? and cnidarians (hydra),? 
and many plants, where they play a major role in adaptive 
evolution (Chapter 10).* 

The G-value enigma emerged later, when the genome 
projects showed that there is no correlation between 
the number of protein-coding genes and developmen- 
tal complexity (Chapter 10). Genome sequencing also 
showed that the ratio of non-protein-coding to protein- 
coding DNA (which intrinsically corrects for ploidy) 
increases with morphological complexity,?'? suggesting 
that, whatever else may be at play, increased complexity 
is associated with the expansion of regulatory informa- 
tion. This imputation can only be falsified by a down- 
ward exception, i.e., the identification of developmentally 
complex organisms that have little non-coding DNA, of 
which none have been found to date.! 

For decades, however, the notion that “the number 
of distinct protein-coding genes that an organism made 
use of was a valid measure of its complexity" was deeply 
rooted and well accepted.? 


DUPLICATION AND TRANSPOSITION 


The mechanisms by which genomes can be enlarged are 
gene, segmental or whole genome duplication — first pro- 
posed by Susumo Ohno in 1970?? — and copy-and-paste 
insertion of sequences from external sources or else- 
where in the genome by transposition. That is, the raw 
material for evolutionary innovation is sequence duplica- 
tion and transposition. The former has been documented 


4 It should be noted, however, that the smallest amphibian genome is 
half the size of the smallest mammalian genome. See http://www. 
genomesize.com. 

* Some species and cell types increase chromosomal/chromatid copy 
number during development. The giant polytene chromosomes of 
salivary glands in Drosophila is one example. 

f Upward exceptions do not negate the possibility that large amounts 
of regulatory DNA are needed to program the ontogeny of complex 
organisms. 
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The range of haploid genome sizes for the groups of organisms listed. (Adapted from an image by Steven Carr, 
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in many species, for example, in yeast and at the origin 
of the vertebrates, where it is evident that whole genome 
duplication has occurred at some point in their evolution- 
ary history, with some duplicated genes having acquired 
new functions and been retained, whereas those that 
remained redundant were largely lost.3032 Partial genome 
(segmental) duplication is also well documented.*? 

The work of Leslie Gottlieb, Donald Levin and oth- 
ers has shown that genome duplication (‘autopolyploidy’) 
and fusion of genomes between related species (termed 
“allopolyploidy”) creates phenotypic novelty and speciation 
— altering patterns of gene expression, physiological 
responses, growth rates, developmental features, reproduc- 
tive outputs, mating systems and ecological tolerances, ^36 
including in Darwin's finches.*’ This led Levin to suggest 
that such nucleotypic effects may ““propel' a population 
into a new adaptive sphere, perhaps accounting for the dis- 
tribution of polyploids, both auto- and allopolyploids, in 
areas beyond those of the diploid parents”.36 

Transposition is a specialized and highly flexible form 
of sequence relocation or (more commonly) ‘duplication’ 
(multiplication) that mobilizes protein-coding and/or 
regulatory cassettes,5-? which explains its evolutionary 
value and distribution (see below). Transposases are, in 


* Wheat is a familiar example.** 


fact, the most abundant and ubiquitous genes in nature. 
Large numbers of various classes of retrotransposed 
sequences occur in multicellular eukaryotes, especially 
plants and animals*+* and the colonization of genomes 
by transposons appears to have occurred in bursts? likely 
associated with major evolutionary adaptations (see, 
e.g., 7246) (Chapter 10). Transposable elements (TEs) are 
diverse and have been widely incorporated into regulatory 
networks in different clades," as predicted by Britten and 
Davidson (Chapter 5), with, for example, most primate- 
specific regulatory sequences having been derived from 
these elements.383 

It is reasonable to assume that genomes contain some 
duplicated or transposed sequences that are in suspen- 
sion between functional exaptation on the one hand and 
degradation or deletion on the other, i.e., have not (yet) 
acquired a useful (new) function nor been lost. It is cur- 
rently difficult, if not impossible, to determine the extent 
of such limbo sequences in any given lineage. One might 
speculate, however, that the more ancient the duplication 
or TE, the more likely it is to have acquired, or already 
have, a useful function that has contributed to its reten- 
tion. One might also speculate that recently acquired 
transposable elements have played a role on phenotypic 
diversification, which has now been well documented??48 
(Chapter 10). 


All That Junk 


MUTATIONAL LOAD, NONSENSE 
DNA, NONSENSE RNA 


The problem was that the large genomes of protists, 
plants and animals, and their large numbers of ‘repeti- 
tive' sequences, could not be reconciled with the protein- 
centric conception of genetic information. 

The population geneticists and evolutionary theorists 
at the time, notably Müller" and Ohno, suggested that, 
since increases in genome sizes in eukaryotes occurred 
by polyploidization, much of the duplicated DNA is 
redundant. They also argued that, if the unique sequences 
(-50% of the genome) in mammals specified structural 
(i.e., protein-coding) genes, there would be -1 million 
such genes, which would, by comparison with bacteria, 
impose an unbearable mutational load, the escape from 
which was the prime function of recombination.^95!5 

Ohno extended this logic to regulatory information, 
speculating that "in order not to be burdened with an 
unbearable mutation load, the necessary increase in the 
number of regulatory systems had to be compensated 
by simplification of each regulatory system. It would 
not be surprising if each mammalian regulatory system 
is shown to have fewer components than the lac-operon 
system of Escherichia coli”. 

Based on these considerations and Haldane's 1957 
“cost of selection” principle, which stated that the num- 
ber of gene loci in a genome is a key determinant of the 
rate of evolution,** Masatoshi Nei concluded in 1969 that, 
given the “high probability of accumulating ... lethal 
mutations in duplicated genomes... it is to be expected 
that higher organisms carry a considerable number of 
nonfunctional genes (nonsense DNA) in their genome" 
and that "higher organisms, including man ... are using 
only a small fraction of the maximum amount of genetic 
information their DNA molecules are able to store”.55%56 
This logic has persisted to the present,” underpinning 
the recent claim, for example, that “the functional frac- 
tion within the human genome cannot exceed 1596.55 

Such early musings were based on the analysis of 
easily discernible simple traits, which constituted the 
majority of genetic studies up until that time and indeed 
until the end of the 20th century. These traits included 


^ Interestingly, based on mutational load arguments at the time, Müller 
estimated that there would be ~30,000 genes in mammals, repeated 
by Ohno, which turned out (much later) to be surprisingly accurate 
for protein-coding genes.* King and Jukes used similar calculations 
to predict an upper limit of 40,000 essential genes.°° Such consider- 
ations do not apply to regulatory sequences if variations within them 
lead to complex trait variation, shown later by genome-wide associa- 
tion studies (Chapter 11). 


75 


metabolic defects; flower and eye color! and severe 
genetic disorders, which usually result from high-impact 
loss-of-function mutations in protein-coding sequences. 
By and large, they did not take into account that varia- 
tions in regulatory sequences that control quantitative 
traits in complex organisms may be more subtle, although 
they may have a strong influence on complex traits and 
reproductive fitness: this was a huge blind spot. 

In this context, it should be noted that the mathemati- 
cal foundations of quantitative genetics were laid down 
with a very different set of problems in mind — such as the 
prediction of short-term responses to artificial selection — 
which went on to focus on genetic diversity based on 
enzyme polymorphisms,”” again before crucial details of 
the variation in genome sequences and of genome regula- 
tion in complex organisms were known.% 

Incorporating molecular considerations, John Paul 
(1972) stated the alternatives that, considering the exis- 
tence of hnRNAs and the size of mammalian genomes, 
"either that the mutational load argument does not hold 
for eukaryotes or [as concluded by others] that much of 
the DNA in eukaryotes is not informational?! 

He speculated that the more and less compact regions of 
chromosomes differ chemically, in that *modified histones, 
modified DNA or extra substances" determined the confor- 
mation of ‘nucleohistone’ (chromatin). He reasoned that non- 
histone proteins would perform this function, with auxiliary 
participation of nascent RNAs. In his model, “address sites” 
in the interbands would be targets for “polyanionic” regula- 
tors, allowing relaxation and transcription of nascent RNA 
that would not only contain an mRNA sequence but would 
also accumulate in these regions, recruiting RNA binding 
proteins and inducing further unwinding of chromatin. 

Paul also used this possible role for nascent transcripts to 
explain the existence of the very large transcriptional units 
(hnRNAs) in animals and plants, which would contain the 
sequence of mRNAs together with redundant sequences 
producing ‘nonsense RNAs’ that perform an ‘unwinding 
role’. Although vaguely defined, this was one of the first 
models that posited RNAs and histone modifications acting 
together to regulate gene expression. This model also pre- 
dicted, as did others (Chapter 5), that these nascent RNAs 
are processed to generate mRNAs, and even suggested the 
existence of sequence signals in the hnRNAs that guide 
the processing into the RNA parts to be degraded in the 
nucleus or to be exported to the cytoplasm.'! 


i The work of Garrod, Cuénot, Beadle and Tatum, Luria and Delbruck, 
Lederberg, Benzer, Müller and others on ‘biochemical mutations’ 
that led to the ‘one gene — one enzyme’ hypothesis (Chapter 2). 

i Which highly influenced early geneticists, including R. A. Fisher (a 
founder of the field of Population Genetics) and the Modern Synthesis. 
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The issue was summarized by Ed Southern in 1974: 
"The outstanding problem presented by eukaryotic DNA 
is that of finding a role for these large fractions not used 
in coding for proteins or cytoplasmic RNAs"? 

Speculation was rife. The evolutionary biologist Tom 
Cavalier-Smith wrote in 1978: 


Eukaryote DNA can be divided into genic DNA 
(G-DNA), which codes for proteins (or serves as 
recognition sites for proteins involved in tran- 
scription, replication and recombination), and 
nucleoskeletal DNA (S-DNA) which exists only 
because of its nucleoskeletal role in determining 
the nuclear volume ...!? 


Others suggested the excess non-coding DNA might be 
retained for “genome balance"? have some value as a 
mutational sponge? or buffering,% or be a reservoir for 
evolutionary innovation.65-68 


‘NEUTRAL’ EVOLUTION 


A natural corollary of the idea that much of the genomes 
of plants and animals is not functional is that these 
sequences are evolving ‘neutrally’. In parallel, the grow- 
ing availability of amino acid sequence data revealed that 
protein sequences have been diverging between lineages 
at a relatively constant rate, referred to as the “molecu- 
lar clock” by Emile Zuckerkandl and Linus Pauling,” or 
"genetic equidistance" by Emanuel Margoliash” in the 
early 1960s. In 1971, Richard Dickerson showed that 
the clock runs at different rates for different proteins” 
(Figure 7.2), later shown to vary by orders of magnitude, 
useful to measure evolutionary relationships over differ- 
ent genetic distances and evolutionary timescales.” 
The divergence of the sequences of homologous pro- 
teins over time was surprising and, after taking into 
account the frequencies of deleterious mutations and 
mutational load, led Motoo Kimura to propose in 1968 
the neutral theory of molecular evolution, or ‘genetic 
drift’, which posited that “an appreciable fraction” of 
the genome was evolving independently of natural 


* These were the first manifestations of the nascent field of bioinfor- 
matics, pioneered also by Margaret Dayhoff and Richard Eck, who 
introduced the concept of molecular phylogeny, reflecting a prescient 
prediction by Crick a few years earlier, when he stated: “Biologists 
should realise that before long we shall have a subject which might 
be called ‘protein taxonomy’—the study of the amino acid sequences 
of the proteins of an organism and the comparison of them between 
species. It can be argued that these sequences are the most delicate 
expression possible of the phenotype of an organism and that vast 
amounts of evolutionary information may be hidden away within 
them.” 
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selection.”-8! Like Nei, Kimura was also motivated by 
Haldane's argument and the 1970s finding that the num- 
bers of nucleotide changes observed between humans and 
chimpanzees could not be explained by selection, called 
*Haldane's Dilemma"? tacitly assuming that mutation 
is random and not influenced by other mechanisms (see 
Chapter 18). 

A similar proposal was made by Jack King and Thomas 
Jukes in their 1969 article entitled “Non-Darwinian 
Evolution, which extolled the importance of random 
genetic changes and genetic drift in evolution.5 The theory 
was refined in 1973 by Kimura's student, Tomoko Ohta, 
and later by others, notably Michael Lynch, who empha- 
sized the importance of nearly neutral ("slightly deleteri- 
ous") mutations, whose exposure to selection is dependent 
on the size of the interbreeding population.9*-55 

The extension of this logic, posited as the ‘null 
hypothesis," is that highly complex organisms with small 
effective population sizes such as mammals accumulate 
greater loads of transposable elements, larger introns (see 
below) and larger intergenic regions, all of which co-vary 
inversely with population size, such that especially large 
bodied species with low population sizes have bloated 
genomes and difficulty in purging even slightly deleterious 
mutations.55-389? Later theoretical studies also concluded, 
mainly based on ‘non-conservation’, that alternative tran- 
scription, polyadenylation, RNA modification and RNA 
editing! sites in complex organisms are non-adaptive.?!-95 

Neutral evolution was controversial in evolutionary 
circles, reflecting a long-standing disagreement between 
“classical” and “quantitative” geneticists that simmered 
for decades, although thought to have been resolved by 
Fisher's infinitesimal model” (Chapter 2). The classi- 
cal geneticists viewed the normal state to be a wildtype 
(protein-coding) gene with a low frequency of deleteri- 
ous (usually recessive) mutants in the population, influ- 
enced by Mendel’s simple trait segregation in peas and by 
genetic (‘Mendelian’) disorders in humans. On the other 
hand, the quantitative geneticists, mainly working in 
agriculture, citing the abundant variation in quantitative 
traits in crop plants and livestock and the ‘concealed vari- 
ability” revealed by inbreeding experiments, proposed 
that many genes have two or more alleles maintained at 
intermediate frequencies in populations by ‘balancing’ 
selection, perhaps influenced by environmental factors.?? 

A renewed debate between the ‘near-neutralists’ and 
‘adaptationists’ ensued following Kimura’s and Ohta’s 


' Despite the fact that RNA editing has expanded greatly and the 
enzymes involved have been subject to strong positive selection in 
the primate lineage (Chapter 17). 
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FIGURE 7.2 The molecular clock. Dickerson's graph of the rates of molecular evolution in fibrinopeptides, hemoglobin and 
cytochrome c. (Reproduced from Dickerson? with permission from Springer Nature.) 


papers.” The former maintained that genetic drift 
accounts for most differences within populations or 
between species, whereas the latter credited them to posi- 
tive selection for adaptive traits,’ although as Laurence 
Hurst later observed “the two positions are often hard to 
discriminate as they make many similar predictions”.” 
These debates did not often consider that there might 
be an important distinction between the genetic signa- 
tures of protein and regulatory variation and were mostly 
thought of in terms of binary (wildtype and “defective”) 


alleles rather than interconnected networks.?9.^" Nor did 
they take into account the role of transposons in pheno- 
typic variation (see below; Chapters 5 and 10), positive 
selection for reproductive success,??!0? or the amount of 
information that might be required to organize the four- 
dimensional development of multicellular organisms?! 
(Chapter 15). Moreover, nearly neutral genetic drift does 
not account for the rapid evolution of animal phyla and 
species, such as observed in the Cambrian explosion, 
Darwin’s finches! and primates,!%-10% and is at odds 
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with the whole genome biochemical indices of function 
that were revealed later (Chapter 13).™ 
As Mayr observed in 1970: 


The day will come when much of population 
genetics will have to be rewritten in terms of 
interaction between regulator and structural 
genes. This will be one more nail in the coffin 
of beanbag genetics. It will lead to a strong rein- 
forcement of the concept that the genotype of the 
individual is a whole and that the genes of a gene 
pool form a unit.!!° 


And Jacob in 1977: 


It seems likely that divergence and specialization 
of mammals, for instance, resulted from mutations 
altering regulatory circuits rather than chemical 
structures. Small changes modifying the distribu- 
tion in time and space of the same structures are 
sufficient to affect deeply the form, the function- 
ing, and the behavior of the final product — the 
adult animal.!!! 


The situation was summarized in 2014 by Karl Niklas: 


Beginning with a series of papers in the early 
20th century and culminating with his book The 
Genetical Theory of Natural Selection, Ronald 
A. Fisher (1930) founded the field of popula- 
tion genetics and designated the gene as the unit 
of stable hereditary transmission between suc- 
cessive generations. This genocentric view of 
inheritance asserted the preeminent importance 
of allele frequency distributions and differential 
reproductive success in evolutionary processes. 
However, it failed to explore alternative origins of 
phenotypic variation. It simply assumed that all 
phenotypic variants result from [protein-coding] 
gene mutations ... Perhaps even more restrictive 
was the additional assumption that the phenotype 
could be mapped directly onto the genotype and 
thus described simply by changes exclusively at 
the level of individual genes or sets of genes.!!? 


Niklas continued: 


This outlook was challenged in the 1970s and 
1980s within a field of study soon to be called 
evolutionary-developmental biology, or simply 
evo-devo, which asserted that evolutionary phe- 
notypic transformations are the result of changes 
in gene expression patterns rather than the 
immediate products of mutations of individual 
genes ... Arguably ... this perspective can be 


? That is not to say, however, that genetic drift is not an important evo- 
lutionary process, and there are likely many passenger or hitchhiker 
sequence variations of subtle effect.” 
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traced back to a seminal paper by Britten and 
Davidson.!? 


Richard Lewontin, who developed some of the statistical 
tools for assessing genetic drift and selection (largely from 
studies of electrophoretic variation in proteins in natural 
populations of Drosophila), ? observed in 1974 that 


For many years population genetics was an 
immensely rich and powerful theory with virtu- 
ally no suitable facts on which to operate. It was 
like a complex and exquisite machine, designed to 
process a raw material that no one had succeeded 
in mining... Quite suddenly the situation has 
changed. The mother-lode has been tapped and 
facts in profusion have been poured into the hop- 
pers of this theory machine. And from the other 
end has issued — nothing ... The entire relation- 
ship between the theory and the facts needs to be 
reconsidered.? 


In 1996, Ohta admitted, with respect to nucleotide substi- 
tution patterns, that "all current theoretical models suf- 
fer either from assumptions that are not quite realistic or 


from an inability to account readily for all phenomena".!? 


CONSERVATION AND SELECTION 


The concept of neutral evolution led to attempts to define 
a subset of sequences that are evolving neutrally, to mea- 
sure the unconstrained rate of sequence drift, and thereby 
determine which (other) sequences in the genome might 
be evolving more rapidly or slowly under positive or neg- 
ative selection," and therefore be functional. 

One obvious candidate was the ‘redundant’ (usually 
third) base of synonymous codons, first exposed by the pio- 
neering sequencing in 1983 of 11 cloned alcohol dehydro- 
genase (Adh) genes in natural populations of Drosophila, 
which revealed 43 previously hidden polymorphisms. 
Only one of these polymorphisms altered a codon speci- 
ficity (and resulted in the known electrophoretic variant of 
the protein), implying that nonsynonymous changes have 
phenotypic consequences and are deleterious, whereas 
the others were possibly neutral.!^ However, later anal- 
yses showed that amino acid codon sequences are not 
evolving neutrally (as is also the case for many non-cod- 
ing sequences), possibly reflecting selection pressures on 


? These contraforces are hard to disentangle over evolutionary time. 
By definition, any useful variation is subject to positive selection 
until it becomes fixed in the population (appearing initially to have 
evolved rapidly by supplanting the previous sequence). It is then sub- 
ject to negative selection as its loss is disadvantageous, and thereafter 
evolves slowly (Chapters 10 and 11). 
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translational efficiency or RNA structure,!!5-112 with oth- 
ers showing that the genetic code is optimal for encoding 
additional information.??-? Later studies showed that 
non-coding polymorphisms affect Adh expression? and 
that Adh variants are selected indirectly.!2* 

The field also looked to other sequences, notably 
*pseudogenes' (see below) and ancient retrotransposons, 
to estimate the rate of neutral evolution, on the question- 
able and likely incorrect assumption that they are non- 
functional (Chapter 10), leading to a vast underestimation 
of the amount of the human genome that is under selec- 
tion (Chapter 11). 

The debate continues,!2%!26 but the concept of neutral 
evolution is coming under siege. As concluded recently 
by Andrew Kern and Matthew Hahn: 


The neutral theory was supported by unreliable 
theoretical and empirical evidence from the begin- 
ning, and ... we argue that, with modern data in 
hand, each of the original lines of evidence for the 
neutral theory are now falsified, and that genomes 
are shaped in prominent ways by the direct and 
indirect consequences of natural selection." 


The adherents begged to differ.'8 

Of course, different types of sequences have differ- 
ent structure-function constraints and different selection 
pressures, as is seen within protein-coding sequences 
where the amino acid sequences of active sites are highly 
conserved but associated scaffolding and domain linker 
sequences are quite plastic.!2%1% Regulatory sequences 
are even more plastic;'31-135 orthologous promoters 
that have no obvious sequence homology direct simi- 
lar expression patterns in fish and humans," and less 
than 5% of human embryonic stem cell developmental 
‘enhancers’ (Chapter 14) are ‘conserved’ in mouse. 
These regulatory sequences also encompass small regu- 
latory RNAs and vast numbers of tissue- and cell-type 
specific long non-coding RNAs, which seem to be even 
more evolutionarily flexible, with different sequence- 
structure-function constraints and including increasing 
numbers of functionally validated species- and clade- 
restricted RNAs (Chapters 12, 13 and 16). 

It is now well established that adaptive radiation in 
complex organisms, including primates, is mostly due to 
regulatory variation, ?7-!? which may be co-dominant and 
therefore immediately visible to selection. Regulatory 
sequences evolve rapidly, mostly (initially at least) under 


? [n general, there are exceptions, such as the ultraconserved elements 
whose sequences evolved more rapidly than those of proteins during 
tetrapod evolution but are evolving far more slowly in the amniotes 
(birds and mammals; Chapter 10). 
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positive selection for changes in morphological and phys- 
iological phenotypes. 

It has also been known for some time, confirmed 
later by the genome projects (Chapters 10 and 11), that 
the mutation spectrum varies enormously across the 
genome,!*+1% which has been rationalized for example 
as local variation in the underlying mutation rate (due 
to regional differences in nucleotide composition) or 
the activity of DNA repair enzymes.!'**-" The alterna- 
tive explanation that the vast non-coding regions of plant 
and animal, especially mammalian, genomes are under 
selection could not be countenanced, both because it was 
assumed to be junk and because it appeared impossible 
due to the mathematical models of selection operating on 
random mutation in small populations. 

A later analysis showed that there are at least seven dif- 
ferent rate classes of sequence evolution in the mamma- 
lian genome,'^ and different rates of sequence evolution 
in gene promoters.!'* Others concluded that ~95% of the 
human genome is influenced by background selection and 
biased gene conversion, commensurate with the propor- 
tion of the genome that is dynamically transcribed into 
(mainly non-protein-coding) RNA (Chapter 13), while 
observations in natural Arabidopsis accessions show that 
epigenome-associated mutation bias occurs differentially 
across the genome and gene regions, with essential genes 
(in particular gene bodies) subject to stronger purifying 
selection having a lower mutation rate.!51,152 


JUNK DNA 


It did not seem to occur to most at the time, apart from 
McClintock, Britten and Davidson and a few others 
(Chapter 5), that the enormous numbers of “repetitive” 
sequences and nuclear-localized RNAs might play a role 
in plant and animal differentiation and development. 
And since no one could countenance gigabases of regu- 
latory protein-binding sites,? and for all of the other rea- 
sons cited above, Ohno summed up the growing consensus 
when in 1972 he wrote about “all that ‘junk’ DNA in our 
genome’ arguing that only a fraction of the human DNA 
functions as ‘genes’ and that there is “more than 90% degen- 
eracy contained within our genome”.* Ohno's conclusions 
were reinforced by the existence of seemingly defective 
*pseudogenes'!! (first identified in 1977 and described as 


P Recent high-resolution data suggest that transcription factor binding 
sites occupy just 0.2% of the genome.!>? 

3 Non-protein-coding DNA has been called by many names. These 
include ‘excess DNA',5*P5 ‘surplus, nonessential, degenerate or silent 
DNA?,156.157 ‘garbage DNA? *non-informational or nonsense DNA?, 
‘vestigial DNA';P* ‘supplementary DNA"? and “incidental DNA?'.!%0 
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‘relics of evolution'!%), the *gene-poor' and transposable 
element-rich heterochromatin, and the extensive intergenic 
regions in intensively studied loci, thought to be genetically 
and transcriptionally silent but later shown to be the sites of 
‘enhancer’ and other regulatory elements that control gene 
expression patterns in development (Chapters 14 and 16). 

Duplications of globin genes were highlighted by 
Ohno and others not only as the source of new functional 
genes, but also of defective pseudogenes with untrans- 
latable sequences (“recent degenerates"),?/69 many of 
which have since been shown to have regulatory func- 
tions!% (Chapters 10 and 13). Notably, the pseudogene in 
the human hemoglobin cluster on chromosome 11, hemo- 
globin subunit beta pseudogene | (HBBPI, or n-globin 
pseudogene in primates), was later found to be subject to 
strong selection, tissue-specifically expressed, essential 
for erythropoiesis, mutated in a form of thalassemia and 
to regulate the switches of globin gene expression during 
development.!70-175 

Some did take issue. Herb Boyer questioned the calcu- 
lations of the extent of functionality in the genome based 
on the assumption that lethal mutation rates apply to the 
whole genome, noting in the discussion of Ohno's paper 
that “we can only measure what we see”.% 

Stephen O'Brien wrote shortly afterwards that the 


conclusions (that) indicate that more than 9096 of 
the eukaryotic genome may be composed of non- 
functional or noninformational ‘junk’ DNA ... 
have not been fundamentally proven; rather they 
are based on simplifying assumptions of ques- 
tionable validity, in some cases contradictory to 
experimental data." 


He challenged the notion that lethal mutation frequency 
was a good metric for gene number and genome func- 
tionality, citing several lines of evidence suggesting that 
mutations only result in lethality in a minority of genes.* 
Noting that hybridization studies indicate the presence 
of a minimum of 300,000 different transcripts of 1 kb or 
more in mouse brain alone, he made the very reasonable 
point that *RNA does not have to be translated to have 


* [n 1986, John McCarrey and Art Riggs proposed that pseudogenes 
might have roles as regulatory switches or “determinator-inhibitor 
pairs’ during development based on antisense relationships, a pre- 
diction validated, at least in part!**-1% (Chapters 9 and 13). 

* Later high-throughput studies showed that a large fraction of protein- 
coding genes in E. coli and yeast do not result in lethality or in easily 
discernible phenotypes when deleted, presumably because labora- 
tory conditions do not recapitulate natural selection for more subtle 
functions or variations in gene expression.!””!7% The same problem, 
of limited phenotypic screens, applies also to animals, especially in 
relation to inter- and intra-species competition, and behavioral and 
cognitive characteristics. 
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a function" and that their existence and their tissue- and 
developmental stage-specific expression transcription 
support their functionality. 

Heterochromatin was also widely thought to be inert, 
despite the fact that it is dynamic, and there was, even then, 
considerable evidence of its importance in developmental 
processes." As observed by Spencer Brown in his inci- 
sive 1966 paper: “Our present picture of gene action comes 
almost exclusively from microorganisms. It is a verbally 
simply one ... the systems controlling gene regulation in 
higher organisms probably involve highly complex mech- 
anisms necessary for developmental integration" Among 
several considerations on the potential roles of heterochro- 
matin, he noted the reports of chromatin associated RNAs 
and pondered regarding the abundant RNAs present exclu- 
sively in the nucleus that “Such observations would make 
sense if the genes in higher organisms were required to 
build complex machinery for their own control”? 

Jim Peacock and colleagues pointed out in 1978: 


In recent years it has become clear that specific 
genetic properties are attributable to heterochro- 
matic regions of chromosomes and that the dif- 
ferent segments of heterochromatin in a genome 
may have different properties ... we present data, 
primarily from Drosophila, to show that the het- 
erochromatin of each chromosome has a unique, 
segmental identity, and that DNA sequences in 
heterochromatin have, as do DNA sequences in 
euchromatin, defined patterns of conservation 
and change during evolution. We show that the 
properties discovered in Drosophila apply to other 
eukaryotes, including plants and mammals."5 


Put simply, the use of the frequency of lethal mutations 
by evolutionary theorists, the emphasis on negative selec- 
tion to assess the extent of functionality of the genomes 
of complex organisms, and the assumptions that repeti- 
tive sequences and pseudogenes are non-functional were 
based on only rudimentary knowledge of molecular 
genetic information and were biased by emphasizing 
deleterious mutations over quantitative trait variations. It 
was conceptually primitive, but unfortunately influential. 

Cloning and more advanced genetic mapping tech- 
niques' would later show the majority of mutations that 
cause severe phenotypic consequences in mammals map 
to protein-coding sequences — what might be called 'cata- 
strophic component damage’. However, the vast major- 
ity (~95%) of variations affecting complex traits — with 
few or only subtle effects on viability — occur outside of 


t Such as the exome sequencing and genome-wide association studies, 
Chapter 11. 
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protein-coding sequences (Chapter 11). This is largely 
invisible to high-fitness-impact and lethality-based mea- 
sures of genetic load but constitutes the more important 
component of phenotypic variability in natural populations. 

However, Mayr, O'Brien and others were swimming 
against the tide. The phrase ‘junk DNA’ entered the pop- 
ular lexicon, uncritically embraced by those — including 
Brenner — who were convinced of the primacy of pro- 
teins in the specification of cell and developmental biology, 
seemingly incurious about what all that non-protein- 
coding DNA might be doing. In fact, as seen below, propo- 
nents of junk DNA explicitly discouraged research into the 
possible roles of non-coding regions of genomes. 


SELFISH DNA 


A logical extension of the junk DNA view was the proposal 
promulgated and popularized by Richard Dawkins in 1976, 
following earlier theorizing by George Williams"? and 
William Hamilton,'*? that DNA sequences have a propen- 
sity to select for their survival, which he termed “the selfish 
gene"?! Dawkins argued that the selfish gene hypothesis 
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can explain the fact that “a large fraction of the DNA is 
never translated into protein”, stating “The simplest way to 
explain the surplus DNA is to suppose that it is a parasite, 
or at best a harmless but useless passenger, hitching a ride 
in the survival machines created by the other DNA”.!8! 
The concept was extended in 1980 by back-to-back 
papers by Ford Doolittle and Carmen Sapienza, and Leslie 
Orgel and Crick, entitled 'Selfish genes, the phenotype 
paradigm and genome evolution’ and ‘Selfish DNA: the 
ultimate parasite”, respectively.155182153 As put by the latter: 


In summary, then, there is a large amount of evi- 
dence which suggests, but does not prove, that 
much DNA in higher organisms is little better 
than junk. We shall assume, for the rest of this 
article, that this hypothesis is true ... What we 
would stress is that not all selfish DNA is likely 
to become useful. Much of it may have no specific 
function at all. It would be folly in such cases to 
hunt obsessively for one.!*? 


The exemplars of selfish DNA were sequences derived 
from (endogenous) retroviruses, transposons and other 
types of repetitive elements (Figure 7.3), which reinforced 
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FIGURE 7.3 Classification of transposable elements and mechanisms of transposition. Class I retrotransposons mobilize via an 
RNA intermediate. Class II DNA transposons utilize a DNA intermediate. Autonomous elements encode the enzymatic machin- 
ery necessary for their transposition. Non-autonomous elements typically do not encode proteins but are capable of being mobi- 
lized using the machinery produced by their autonomous counterparts. (Reproduced from Serrato- Capuchina and Matute,* under 


Creative Common CC BY license.) 
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the view that these elements are (mainly) genetic hobos," 
notwithstanding McClintock's demonstrations that trans- 
poson mobilization changes developmental phenotype 
and responses to the environment.^^ The view of trans- 
posons as functionless and/or deleterious parasites’ has 
endured,??-?! reinforced by the discovery that they are 
restrained by methylation (Chapter 14) and other ‘silencing’ 
mechanisms. Although their discovery earned McClintock 
a Nobel Prize in 1983, it seems that her emphasis and insis- 
tence on them as controlling elements in differentiation 
and development, and the finding that TEs cause inser- 
tional mutations in bacteria, delayed the award.!??.19? 

The role of transposons as mobile cassettes of genetic 
(especially regulatory) information in evolutionary and 
biological processes, and the proportion of transposon- 
derived sequences that may have contributed or acquired 
useful functions in genomes’ was then and is still not 
known, but it suited the zeitgeist to assume that it is low. 
The selfish fattening out by transposons of intergenic 
sequences and introns within genes (see below) then 
became the widely accepted explanation for all that junk, 
and the C-value enigma.92.195.196 


GENES-IN-PIECES! 


Perhaps the most unexpected discovery in the history of 
molecular biology was that genes in eukaryotes, especially 
developmentally complex eukaryotes, are not co-linear 
with their encoded proteins, but rather are fragmented and 
separated by non-protein-coding ‘intervening’ sequences 
or ‘intragenic regions’, dubbed “introns”.!57197.198 The frag- 
ments of protein-coding sequences (and flanking regulatory 
sequences in mRNAs) were reciprocally called ‘exons’. 

In 1975, Darnell and colleagues showed that adeno- 
virus mRNA is derived from a high molecular weight 
precursor.!* In 1977, Phillip Sharp and Rich Roberts and 
their colleagues observed under the electron microscope 
that adenovirus mRNAs do not hybridize contiguously to 
the adenovirus genome, but rather loop out in segments 


They “acquired the anthropomorphic labels of ‘selfish’ and ‘para- 
sitic’ because of their replicative autonomy and potential for genetic 
disruption". 

' It is clear that retroviral and retrotransposon insertions in some 
instances disrupt protein-coding genes,!% but it is also clear that 
some, if not many or most, have far from random genomic- and clade 
distributions and have been exapted to function, “nature's tools for 
genetic engineering'/59-555 (Chapters 10 and 16). 

The etymology is exon = EXpressed regiON, coined by Gilbert in 
1978: “The notion of the cistron... must be replaced by that of a tran- 
scription unit containing regions which will be lost from the mature 
messenger — which I suggest we call introns (for intragenic regions) — 
alternating with regions which will be expressed — exons."?? 


Z 
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(Figure 7.4), indicating that the mRNA is derived from 
regions of the genome that are not adjacent,2%202 con- 
firmed by others.??? 

The same phenomenon was soon reported in 
vertebrate genes encoding  P-globin,?%%2%5 chicken 
ovalbumin (Figure 7.5) and lysozyme,?0%20 and immu- 
noglobulin light chain,?% and in ribosomal RNA genes in 
Drosophila," whose cloned mRNA (cDNA or comple- 
mentary DNA) sequences hybridized to multiple larger 
sized fragments of restriction endonuclease-digested 
genomic DNA in Southern blots. This is impossible to 
explain unless the cDNA sequences were spread over a 
large section of genomic real estate, which was confirmed 
by transcript mapping and sequence analysis.?11213 

It transpired that the intervening sequences are 
‘spliced’ out from the primary transcripts/?7202.214213 — 
now called pre-mRNAs - in the nucleus, by a complex 
RNA-guided and catalyzed process (see following chap- 
ter), which explained the previously observed hnRNAs. 
The re-assembled ‘mature’ mRNAs are then exported 
to the cytoplasm for translation into proteins, so all 
was right with the gene- protein worldview, even if it is 
stranger than could possibly have been imagined. 

The discovery of split genes (or “genes-in-pieces” 
as phrased by Walter Gilbert!) and mRNA splicing in 
eukaryotic cells was “a complete shock to the scientific 
world", as it broke another fundamental tenet of gene 
expression — the concept of collinearity — as "everyone 
assumed that the structure of a gene was a contiguous 
string of base pairs, from which information was trans- 
ferred for synthesis of a protein”. 

The presence of introns interrupting the mRNA 
sequences of eukaryotic genes was immediately and 
universally assumed to be another manifestation of, and 
proffered as further evidence for, junk DNA!55182,183,216 — 
notwithstanding and not considering the obvious alter- 
native that other information may be transmitted by 
the excised non-protein-coding RNA,?" and contempo- 
rary reports of “intron-mediated enhancement" of gene 
expression.?!* That the possibility that introns or intronic 
RNAs contain functional signals was not canvassed at 
the time is testimony to the strength of the belief that 
genetic information is (only) transduced through pro- 
teins, entrenched just 16 years after the lac operon. 

Nonetheless the discovery of introns meant that the 
mystery of mammalian mRNA biogenesis had been 
solved.?? It also helped to explain the vast amount of 
non-coding DNA in the genomes of higher organisms, 
and the existence of hnRNAs and the excess of RNA 
in the nucleus, reconciling the Central Dogma with 
these unusual features of eukaryotic gene expression. 


All That Junk 


83 


FIGURE 7.4 Electron micrographs of a hybrid between and adenovirus-derived mRNA and adenovirus DNA, with arrows 
showing boundaries of the R-loop of single-stranded DNA that is not present in the mRNA, the first demonstration of the presence 
of introns. (Reproduced with author permission from Berget et al.?°°) 


Crick described introns as ““nonsense” stretches of DNA 
interspersed within the sense DNA”?! As put by Ohno, 
while bacterial genomes are “small and tidy", filled 
with polycistronic genes, the genomes of vertebrates are 
“untidy to the extreme", with genes spaced very apart 
from each other in such a way that “translation through 
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so long spacer is out of question ... [and] there was 
no choice but to achieve the fusion of adjacent coding 
sequences at the post-transcriptional level”.220 

It was also assumed that the excised intronic RNA 
is quickly degraded and the ribonucleotides recycled, 
although the technology of the time was too primitive 
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FIGURE 7.5 Exon intron structure of the chicken ovalbumin Y gene. Filled boxes indicate protein-coding sequences, with 
unfilled areas indicating 5' and 3' untranslated regions in the mature mRNA. (Reproduced from Heilig et al.?5 with permission 


from Oxford University Press.) 
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to draw this conclusion. Northern blots, which may have 
been able to track the fate of the excised RNAs, had just 
been introduced (Chapter 7) and often relied on polyA- 
based purification protocols that neither capture nor 
detect spliced out RNAs.* Sharp and colleagues stated 
(with supporting references) that "introns are excised 
from pre-mRNAs with a half-life of 3 seconds to 30 
seconds”?4 but a year later asserted (without supporting 
references) that excised introns “are rapidly degraded... 
(with) a half-life of ... the order of a few seconds ,?? an 
entirely different statement, indicative of the logical slip. 
In fact, intron-specific in situ hybridization showed that 
excised intronic RNAs can be relatively stable and eas- 
ily detectable in the nucleus.?? Later studies showed that 
many functional RNAs are derived from introns (Chapter 
8) and that intronic RNAs — including retained introns 
and intron-derived RNAs - constitute the major fraction 
of the non-coding RNAs in mammalian cells.?2 

Sweeping introns under the intellectual carpet as 
junk (like the transposon-derived sequences within 
them) still left the question of how the split gene 
arrangement came to be in the first place. Their exis- 
tence was subsequently rationalized by Gilbert as a 
hangover of the primordial assembly of genes from 
fragments of protein-coding information (the ‘introns- 
early hypothesis’).!>7 

Gilbert also predicted that the presence of introns 
would enable ‘alternative’ splicing and thereby the evo- 
lution of modularity in protein-coding genes, expanding 
the repertoire of protein isoforms in complex organ- 
isms!” (Figure 7.6). This proved to be correct,????? 
and was recently shown to include the exonic capture 
of fragments of transposable elements to allow the pro- 
tein to act as a genome-wide transcriptional regulator, 
leading to the conclusion that “TES interacting within 
their host genome provide the raw material to generate 
new combinations of functional domains that can be 
selected upon and incorporated within the hierarchical 
cellular network "^! 

Gilbert's hypothesis was elaborated independently 
by Darnell, Doolittle?^ and Colin Blake; with the 
sequitur that exons would be predicted to encode pro- 
tein functional units or "smaller, supersecondary struc- 
tures”.235 While there was evidence that some exons 
corresponded to protein domains,?3%-2% it was difficult to 
show that most protein-coding exons comprised modu- 
lar elements of protein structure,?7??? and later studies 


* Other studies showed that at least 4096 of all RNAs in human cells 
are not polyadenylated (Chapter 13).221-223 


RNA, the Epicenter of Genetic Information 


showed that alternative splicing is more common in regu- 
latory sequences in mRNAs and non-coding RNAs than 
in protein-coding exons.?^? 

Developmentally complex organisms have a greater 
number and larger size of introns,?*2% comprising at 
least 40% of the human genome — and likely much more, 
given that there are many distal alternative promoters and 
5' exons expressed in early development, introns in genes 
encoding non-coding RNAs, and many genes enclosed 
within introns of other genes (Chapter 13). 

By contrast, it was argued, the genomes of fast-grow- 
ing microorganisms had been streamlined under pressure 
for rapid replication, overlooking the fact that develop- 
mentally complex eukaryotes had microbial ancestors for 
at least a billion years, which would have been subject to 
the same pressures. As Gilbert expressed it: “... introns 
were lost in the course of evolution ... [and] only genes in 
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FIGURE 7.6 Types of alternative splicing. (Reproduced from 
Blencowe 2? with permission of Elsevier.) 
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slowly replicating cells of complex organisms still retain 
the full stigmata of their birth??? This is, in evolutionary 
terms, a non sequitur, but nonetheless was repeated by 
others. Brenner in 1990: 


There is a view that E. coli is primitive and we 
are advanced. That is true from the point of view 
of function and action. But it is not true from 
the point of view of genome structure. Here it 
is E. coli that is streamlined and sophisticated, 
whereas it is our genome that has preserved a far 
more primitive condition.?^^ 


Introns were later found to reside in out-of-the- 
way places (non-translated RNA genes) in bacte- 
ria,?? to have self-splicing capability and to be able 
to act as mobile genetic elements?^9 (Chapter 8). 
Cavalier-Smith, Norman Palmer and John 
Logsdon Jr. argued that it was more likely that 
introns entered (by reverse splicing, Chapter 8) 
and expanded in complex organisms late in evolu- 
tion,216241247248 while not challenging the assumption 
that they are devoid of information. This view per- 
sisted despite the examples of conserved sequences 
within them, which if removed or mutated have pheno- 
typic effects (and, as seen later, also encoding distinct 
and stable RNA species).2*9-252 

The prevailing view was summarized by Matt Ridley 
in his 1999 book Genome: The Autobiography of a 
Species in 23 Chapters: 


Each gene is far more complicated than it needs 
to be, it is broken up into many different 'para- 
graphs' (called exons) and in between lie long 
stretches (called introns) of random nonsense 
and repetitive bursts of wholly irrelevant sense, 
some of which contain real genes of a completely 
different (and sinister) kind ... But ninety-seven 
per cent of our genome does not consist of true 
genes at all. It consists of a menagerie of strange 
entities called pseudogenes, retropseudogenes, 
satellites, minisatellites, microsatellites, trans- 
posons and retrotransposons: all collectively 
known as ‘junk DNA’, or sometimes, probably 
more accurately, as ‘selfish DNA’. Some of these 
are genes of a special kind, but most are just 
chunks of DNA that are never transcribed into 
the language of protein.2% 


The presumed irrelevance of the vast tracts of tran- 
scribed non-protein-coding RNAs — “Mother Nature's 
dirty little secret"?5? or “junk in the attic" 6 — became 
accepted as such. 
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NOT JUNK? 


There was an alternative, equally if not more plausible, 
and far more interesting, possibility canvassed by John 
Mattick in 1994?" i.e., that the separation of transcrip- 
tion from translation in eukaryotes allowed the invasion 
of protein-coding genes’ by introns.^92*^ He posited 
that the evolution of the spliceosome then allowed these 
sequences to explore new genetic space and to acquire 
functions as RNA regulatory signals (or “informational 
RNA, ¡RNA) expressed in parallel, akin to efference sig- 
nals in neurobiology,” accounting for the expansion of 
these sequences. He also predicted that some, and per- 
haps most, genes in complex organisms express regu- 
latory RNAs and that the evolution of RNA regulatory 
networks was the enabler of the appearance and radiation 
of developmentally complex animals.?" That is, plant and 
animal genomes are not full of non-functional remnants 
of early evolution colonized by parasitic genetic hobos 
but are largely devoted to the specification of regula- 
tory RNAs required for multicellular development?56-266 
(Chapters 12-14 and 16). Later studies showed, inter alia, 
that ‘enhancers’ with tissue-specific activity (Chapters 14 
and 16) are enriched in introns?* and that many small 
regulatory ‘microRNAs’ are derived from introns 
(Chapter 12).268 
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8 The Expanding Repertoire of RNA 


The biochemical analyses of RNA in the 1960s detected 
many short RNAs in the nucleus and cytoplasm of 
eukaryotic cells? using new techniques of radioactive 
labeling, differential sedimentation and gel electrophore- 
sis, and better procedures for isolating intact RNAs with 
detergents and chaotropic agents! to overcome degrada- 
tion by RNases. 

Initially there were concerns that the small RNAs are 
by-products of the biogenesis or degradation of larger 
RNAs. On the other hand, contrary to the generalization 
that nuclear RNAs are transient, destined for process- 
ing and export to the cytoplasm, the few groups study- 
ing these newly identified low molecular weight RNAs 
reported that they are highly expressed and account for 
-20% of the nuclear RNA in mammalian cells.? They 
were also found to differ in size and sequence compo- 
sition from tRNAs and rRNAs, and to be metabolically 
stable? 

At least 10 discrete RNA species were identified, some 
of which contained methylated nucleotides, localized in 
specific subnuclear fractions (the nucleoplasm, chromatin 
or the nucleolus), with others in the cytoplasm.* Many 
of those RNAs in the nucleus were uridine-rich, leading 
to their designation as ‘U RNAs’, numbered in the order 
of their discovery as UI, U2, U3 and so on.* Robert 
Weinberg and Sheldon Penman named them "small 
nuclear RNAs" (snRNAs).®!° 

It was also found that RNA polymerase III is responsi- 
ble for the transcription of many of these small RNAs, not 
RNA polymerase II, which transcribes mRNAs (and long 
non-protein-coding RNAs, Chapter 13), indicating that 
different RNA polymerases synthesize different classes 
of RNA.!-? RNA polymerase III products also include 
RNAs originating from repetitive sequences, only later 
characterized, such as those transcribed in human cells 
from Alu elements. 

The characterization of snRNAs in the following 
decades revealed that they had ‘housekeeping’ functions 
in the modification and maturation of rRNAs, tRNAs 
and mRNAs, as well as other functions in gene regula- 
tion and cellular processes such as protein export. These 
years also saw RNAs encroach on the traditional domain 
of proteins, catalysis, which in turn led to a plausible 


a Also in bacteria (Chapter 9). 
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explanation of the molecular origin of genetic informa- 
tion, with RNA at its core. 


SPLICEOSOMAL RNAs 


Although their roles were unknown, snRNAs were too 
small to function as mRNAs and, being mainly nuclear, 
did not seem to be directly involved in protein synthesis. 
On the other hand, some snRNAs contained sequences 
complementary to hnRNAs. This led Michael Lerner 
and Joan Steitz, and independently John Rogers and 
Randolph Wall, to propose in 1980 that these RNAs 
play a role in RNA splicing,^-" based on earlier work on 
sense-antisense interactions between mRNA and rRNAs 
sequences in translation initiation.'*!° In particular, the 
complementarity of the Ul snRNA sequence to both 
the 5’ and 3’ splice site sequences of hnRNAs led to the 
hypothesis that splice sites are recognized and aligned 
through RNA-RNA interactions between the splice sites 
and Ul snRNA!” 

The characterization of the functions of these RNAs 
in the 1980s was aided by the fortuitous discovery that 
antibodies in the serum of individuals suffering autoim- 
mune disorders, such as lupus erythematosus, precipi- 
tated ribonucleoprotein complexes (RNPs) containing 
snRNAs.!520-23 Several of the snRNAs interacted with a 
common antigen, the Sm (Smith) antigen,> named after 
the first lupus patient in whom such antibodies were 
detected.”°?> Other autoantigens were associated with 
other RNP complexes.? 

These antibodies were used not only to purify the 
complexes but also to block the function of the cor- 
responding snRNPs in vitro, which showed that splic- 
ing of pre-mRNA is inhibited by targeting the UI RNP 
and therefore that snRNAs are required.2%% Chemical 
cross-linking confirmed that Ul and U2 RNAs do, 
in fact, base pair to hnRNAs in the nucleus??? and 
genetic complementation experiments confirmed that 
UI, U2, U4, US and U6 snRNAs function in splicing.?>*° 
Characterization of the process revealed that large RNPs, 
which came to be known as ‘spliceosomes’, incorporate 
these snRNAs, wherein they interact with each other and 


^ Sm proteins participate in a wide variety of RNA transactions and 
predate the Archaea-Eukarya split.?* 
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The mechanism of splicing and the complexity of the RNA interactions and structures (a) The RNA interaction net- 


work before the first trans-esterification reaction. The dotted lines indicate the triplex interactions at the catalytic core. (b) Three- 
dimensional structure of the active site RNA in the C complex. Magnesium ions are represented by two yellow spheres located 
between the backbone of the catalytic triad and the highly twisted backbone at the bulge in the internal stem loop. (c) Structure of 
the 5’ exon and branched intron bound to the active site (overlaid on the structure in b). (d—f) Interaction of the catalytic core of the 
spliceosome and movement of the branch helix, RNAs are color-coded as in a. The yellow arrows indicate the active site metals. 
(Reproduced from Fica and Nagai*! with permission from Springer Nature.) 


with target pre-mRNAs to guide the process of splicing 
(Figure 8.1).25:3033 

Much later, it was found that Ul and U4 snRNAs 
also regulate transcriptional initiation, transcript 
structure and chromatin architecture,** and that U2 
snRNA is required for RNA polymerase II pausing,*! 
coupling transcription to splicing, which are inter- 
twined processes (Chapters 14 and 16). It was also 
discovered that there is a minor class of spliceosome, 
which contains specific snRNAs (UII, U12, U4atac 
and U6atac) equivalent to but distinct from their coun- 
terparts in the major U2-type spliceosome (U1, U2, U4 
and U6), and which recognizes a rare class of introns 
initially referred to as AT-AC introns, now called U12- 
type introns.?-^ The minor spliceosome is required 
for development and may have particular functions in 
the brain.+-48 


SMALL NUCLEOLAR RNAs 


Other small RNAs had other functions. The highly con- 
served U3 RNA was found to associate with 28S rRNA? 
and to be localized in the nucleolus,!!*° where ribosome 
biogenesis occurs, which led Jean-Pierre Bachellerie to 
suggest in 1983 that U3 and other “small nucleolar RNAs” 
(snoRNAs) participate in this process.°° Subsequently it 
was also found that autoimmune antibodies recognizing 
a nucleolar protein, fibrillarin,? would co-precipitate not 
only U3 but also the less abundant U8 and U13 RNAs. 
It was later found also to bind U16.5 

Similar to the observations that led to the elucidation 
of the roles of snRNAs in pre-mRNA splicing, snoRNAs 
were found to have short sequence motifs complementary 
to rRNA sequences, which indicated that snoRNAs were 


* Although they were originally named based on their nucleolar local- 
ization (in contrast to snRNAs), snoRNAs are also found beyond the 
nucleolus?! and are secreted from cells.*? 
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(b) 


Target RNA 


FIGURE 8.2 Schematic representation of C/D (a) and H/ACA (b) box snoRNAs. Consensus box sequences are highlighted in 
green. The conserved structures within these snoRNAs guide effector protein complexes that catalyze 2'-O-methylation or pseu- 
douridylation respectively. (Reproduced from Abel and Rederstorff%* with permission from Elsevier.) 


involved via base pairing in rRNA processing, modifica- 
tion or other aspects of ribosome biogenesis.^? 

Nevertheless, it was only in 1990 that U3 was shown 
to be essential for rRNA processing,” and in 1996 
that snoRNA U24 directs site-specific methylation of 
rRNAs? with subsequent studies showing that other 
snoRNAs perform similar functions via base pairing 
with target sequences adjacent to modification sites.^!^$ 
Thus, although first identified in the 1960s, it was three 
decades before snoRNAs were defined as a new “class of 
RNAs" with demonstrated functions.?.^? 

SnoRNAs are -60-300nt in length and are classified 
into two families (based on typical sequence motifs and 
structural features) that guide enzyme complexes to per- 
form 2’-O-ribose methylation (‘C/D box’ RNAs) or pseu- 
douridylation (“H/ACA box’ RNAs) respectively of target 
nucleotides (Figure 8.2), not only in rRNAs and tRNAs 
but also in snRNAs.9? Homologs of C/D box and H/ACA 
box snoRNAs occur in archaea, where they also guide 
modifications of tRNAs, indicating that they first evolved 
over 3 billion years ago.°'-® 


There are also many snoRNAs that show tissue- 
specific expression and whose targets are unknown, 
described as “orphan” snoRNAs.9? Some were later 
found also to be involved in RNA processing, including 
the C/D box snoRNAs U8, U14, and U22, as well as H/ 
ACA box snoRNAs snR10, snR30, E2 and E3, which 
direct site-specific cleavage of pre-rRNAs.9 Yet, other 
snoRNAs were shown to regulate alternative splicing 
by base pair recognition® and to guide other modifica- 
tions such as acetylation of specific cytosine residues 
in 18S rRNA.9 

More was to come. In 2001, Beata Jady and Tamas 
Kiss identified a sno-like RNA, U85, containing both H/ 
ACA and C/D box motifs, which is localized in ‘Cajal 
bodies’ (a subnuclear domain associated with nucleoli, 
discovered by Santiago Ramón y Cajal in 190395) and 
guides 2’-O-ribose methylation and pseudouridylation 
of the U5 spliceosomal snRNA.9? The snRNA U7 is 
also localized in Cajal bodies and participates in his- 
tone pre-mRNA 3’ end formation.””! Other small Cajal 
body-specific (‘sca’) RNAs have since been identified, 
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some with a composite structure similar to U85, while 
others have only H/ACA box or C/D box domains, 
which guide modifications of spliceosomal snRNAs” 
and tRNAs, the latter as part of a stress response. 
The RNA component of the human telomerase complex 
also contains a characteristic H/ACA box scaRNA- 
like structure and also localizes in Cajal bodies.?-7? 
Primate-specific H/ACA snoRNA-like RNAs, called 
AluACA RNAs, are derived from intronic Alu-repeat 
RNAs? and new functions of snoRNAs continue to 
be discovered, such as the maintenance of chromatin 
accessibility.” 

There are over 700 known snoRNA and snoRNA-like 
RNAs encoded in the human genome.5 Most snoRNAs 
are produced by processing of intronic RNAs excised 
from host transcripts,*! commonly those of genes encod- 
ing proteins involved in translation or ribosome biogen- 
esis, including ribosomal proteins, translation factors and 
nucleolar proteins such as fibrillarin,6%%2 the first evidence 
of parallel genetic output. Many other snoRNAs are 
derived from the introns of transcripts that do not encode 
proteins,5%-88 some involving species-specific alternative 
splicing,? and whose primary function in many cases 
is uncertain, although others, like Gas5,% have dem- 
onstrated roles as long regulatory RNAs (Chapters 9 
and 13). Some snoRNAs are expressed exclusively in the 
brain.>!:9!-93 

The complexity of the relationships between small 
RNAs is illustrated by the later discovery that snoRNAs, 
from yeast to humans, are processed to produce three 
subspecies, one of which functions as a microRNA in 
the RNA interference pathway$*9?^9 (Chapter 12). The 
complexity of the networks, and an indication of how 
much is yet to be understood about them, is highlighted, 
for instance, by the observation that a human-specific 
snoRNA is attached to the end of a longer non-coding 
RNA that regulates rRNA biogenesis and nucleolar struc- 
ture (via phase separation, Chapter 16).%% In addition, 
many snoRNAs accumulate in the form of stable lariats 
instead of fully processed snoRNP particles, with as yet 
unknown functions.!% 

Aberrant expression of snoRNAs has been linked 
to human disease. There is a large cluster of C/D box 
snoRNAs in a parentally imprinted gene that is normally 
expressed in the brain from the maternally derived allele, 
perturbations of which are associated with Angelman 
and Prader-Willi syndromes.55?1.01.12 One of the snoR- 
NAs in this region, HBII-52, contains an 18 nucleotide 
sequence that is complementary to an exon in the sero- 
tonin receptor 5-HT(2C) mRNA and mediates its alterna- 
tive splicing.’ 
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OTHER SMALL GUIDE, SCAFFOLDING 
AND REGULATORY RNAs 


7SL and 7SK RNA species, their names reflecting their 
sedimentation coefficient, were identified by Penman 
in 1976.!% These RNAs were initially thought to have a 
viral origin? but were shown to be present in uninfected 
cells.104.105 

7SL RNA is ~300nt in length. It is ubiquitous in 
eukaryotic cells and was found, accidently,’ to be an 
essential component of the protein export ‘signal rec- 
ognition particle’ (SRP),!°° characterized in the 1970s 
and 1980s by Günter Blobel and colleagues. The SRP 
associates with the ribosome and targets nascent proteins 
to the endoplasmic reticulum via an N-terminal ‘signal’ 
or ‘leader’ sequence for membrane insertion or secre- 
tion into the extracellular milieu,!°!! with 7SL RNA 
acting as a scaffold upon which the six proteins of the 
SRP assemble./??.? Similar RNAs were later shown to be 
involved in protein export in bacteria and archaea.!!!-112 

7SL RNA was subsequently found to be required 
for the selective packaging of the RNA/DNA modify- 
ing enzymes APOBEC3G and 3F into retroviral par- 
ticles!!5-1!7 and to repress the translation of the tumor 
suppressor TP53.!? It is also a precursor of the Alu ele- 
ments in the human genome (Figure 8.3).!19-121 

7SK RNA is highly expressed in vertebrates and, like 
most RNAs, has a complex structure.!? Early evidence 
indicated that 7SK regulates transcription and transcrip- 
tion termination in a tissue- and species-specific man- 
ner, ?*.?4 but mechanistic insights would not emerge until 
the early 2000s. These studies showed that 7SK RNA 
acts as a trans-acting negative regulator, uncovered ser- 
endipitously in biochemical assays to identify factors that 
regulate RNA polymerase II, similar to the "general tran- 
scription factor" role identified for the small 6S RNA in 
bacteria!” (Chapter 9), but having additional roles in gene 
expression regulation in animals!26-134 (Chapter 13). 7SK 
is also present in invertebrates, and, despite the fact that it 
is little primary sequence similarity, its identification was 
possible due to its conserved secondary structural motifs 
and domains.!35.136 

In one of the earliest demonstrated regulatory 
RNA roles, in 1980, Hugh Pelham, Robert Roeder 


4 Blobel and colleagues were expecting only proteins to be compo- 
nents of the SRP. The presence of RNA was revealed by a strong 
UV absorbance signal at a wavelength typical of nucleic acids in the 
SRP preparation, detected as a result of an incorrect setting on the 
detector! 
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FIGURE 8.3 The structure of 7SL RNA, showing the part coopted into Alu transposable elements in primates. SPR19 binding 
sites are shown. (Reproduced with minor modifications from Itano et al.!! with permission from John Wiley and Sons.) 


and colleagues discovered that the transcription factor 
TFIIIA* not only binds the 5S rRNA gene! but also its 
transcript, which results in a feedback loop that titrates 
the transcription factor away from the gene, inhibiting 
further transcription and stabilizing the transcript until 
required for ribosome assembly.!*°-!4! 

Y RNAs were identified in 1981 as components, like 
snRNAs, of autoantigens in systemic lupus patients.21-12 
There are four distinct and highly conserved Y RNAs (in 
humans ranging from ~80 to -110nt),'* which are struc- 
tural components of the Ro autoantigen.!**-146 The Ro pro- 
tein, lack of which causes a lupus-like syndrome in mice, 
appears to prevent autoimmunity by recognizing mis- 
folded RNAs, with Y RNAs regulating the process.!4-^? 
Y RNAs also occur in bacteria where they associate with 
orthologs of mammalian Ro and are involved in rRNA 
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TFIIA was shown by Aaron Klug and colleagues to interact with 
RNA and DNA via repetitive domains stabilized by zinc, called “zinc 
fingers'.!% Zinc finger transcription factors were later found to be the 
largest class of transcription factors in plants and animals, compris- 
ing 3% of the genes in the human genome? (Chapter 16). 

A non-coding RNA, termed 5S-OT, is transcribed from 58 rDNA 
loci in eukaryotes and has been shown to regulate transcription of 
5S rRNA in mammals. An antisense Alu element has inserted at the 
5S-OT locus in monkeys, apes and humans and regulates alternative 
splicing of other genes via Alu/anti-Alu pairing.'?? 


ES 


maturation and stress responses,??-? with a modular 
structure that includes a domain that mimics tRNAs,? 
indicating an ancient function in cellular RNA biology. 

Vault RNAs, short 80-150nt RNAs transcribed by 
RNA polymerase III, so named because of their presence 
in large ovoid ribonucleoprotein particles in the cytoplasm 
of eukaryotic cells that resemble the arches of cathedral 
vaults, were discovered in 1986 by Nancy Kedersha and 
Leonard Rome.!5*155 Vault RNAs are considered essential 
for eukaryotic cell biology because of their high conserva- 
tion and near ubiquitous presence.!5 Their function is not 
well understood, but recent evidence indicates that they 
play a role in regulating autophagy,'*’ i.e., the degradation 
and recycling of cellular components in lysosomes,!% as 
well as apoptosis!/??/9? and signaling pathways involved in 
neuronal synapse formation and plasticity.!6! 

Other RNAs discovered in these years were the 
highly abundant viral RNAs in infected cells, some of 
which interact with autoantigens. The -160nt VA RNAs 
(virus-associated RNAs) present in cells infected with 
adenovirus'® (Figure 8.4) were in fact the first non-coding 
RNAs described after the ‘canonical’ RNAs (rRNA, 
tRNA and mRNA) and the first non-coding RNAs shown 
to be expressed from mammalian viruses, reported in 
the same year as the identification of the /ac repressor. 
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FIGURE 8.4 The discovery of low molecular weight virus-associated RNA (VA RNA): size chromatography of newly synthe- 
sized (radiolabeled) RNA isolated from infected (open circles) and uninfected (closed circles) cells infected with adenovirus (UV 
absorbance shown by triangles). (Reproduced from Reich et al.!% with permission from Elsevier.) 


VA RNAs were later shown (among other roles) to inhibit 
protein kinase R (PKR) to curb the innate immune 
response and enhance the translation of viral RNAs.!93-169 

RNAs that are not highly conserved across large evo- 
lutionary distances but have specific expression patterns 
were also identified during the 1980s. Examples include 


* PKR is in fact variously inhibited or activated by trans-acting 
RNAs.!? 


neuronal RNAs that are transported into dendrites, such as 
B2, a -180nt RNA transcribed by RNA polymerase III from 
repeated sequences in mice, which shows higher expression 
in some tumor cells and heat-shocked cells, 9-7 and BCI 
(brain cytoplasmic RNA 1), a -150nt RNA transcribed in 
rats from repeated sequences derived from a tRNA,'?-1 
with a human equivalent, BC200."9!7 Transgenic mice 
lacking BC1 have no obvious developmental deficiency, but 
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display reduced exploration activity and increased anxiety, 
a phenotype that is invisible in the cage but likely lethal in 
the wild."* 

Small RNAs are also required as primers for DNA 
replication!”%180 and for the maintenance of telomeres, 
shown by Elizabeth Blackburn, Carol Greider and Jack 
Szostak to be accomplished by retrotransposon-derived 
RNAs operating on repeat sequences to replicate chro- 
mosome ends.!52-159 

Small common RNAs such as snRNAs and snoRNAs 
were relatively easy to find. Others were identified later 
in genomic datasets by characteristic sequence motifs 
and secondary structures, using a growing suite of bioin- 
formatic tools!%-1% and databases such as Rfam!?*9 and 
RNAcentral.' On the other hand, lower copy number 
transcripts that are less conserved and expressed only in 
particular circumstances or cells were difficult to detect, 
a problem compounded by the lack of anticipation of the 
existence of cell-specific transcripts beyond mRNAs and 
their nuclear precursors (Chapters 12 and 13). 


CATALYTIC RNAs AND THE ANCIENT 
RNA WORLD HYPOTHESIS 


The participation of RNAs in a variety of cellular pro- 
cesses and the existence of many RNPs with different 
functions signaled that RNAs are versatile. As put by 
Crick in 1966, referring to the ability of RNA to form 
complex secondary structures: “tRNA looks like Nature's 
attempt to make RNA do the job of a protein."!?? 

In 1957, an influential symposium in Moscow on the 
origin of life speculated that RNA most likely preceded 
proteins in the origin of life, a view supported there by 
Brachet, Mirsky, Oparini and others.” In 1962, Alex 
Rich proposed that RNAs had a central role in the origin 
of life.??? In the late 1960s, Orgel, Woese and Crick also 
hypothesized that RNAs might have preceded proteins in 
a pre-cellular world, predicting that RNAs possessed the 
required enzymatic activities.!?7201.202 

Nonetheless, the existence of catalytic RNAs in extant 
organisms was completely unexpected.?% In 1982, Tom 
Cech and colleagues discovered that RNA can perform 
autocatalytic 'self-splicing' rearrangements, removing 


^ The existence of telomeres to protect chromosome ends was inferred 
by McClintock and Muller in the 1930s. Muller coined the term 
‘telomere’ from the Greek ‘telos’ (end) and “meros” (part).!*! 

i Oparin previously had a long running debate with Hermann Muller, 
with Oparin maintaining that life was the outcome of a step-wise 
process of pre-cellular evolution of membrane-bound polymolecular 
systems, whereas Muller argued that life started with the appearance 
of the first nucleic acid molecule.!^* 
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the intervening sequence of rRNA precursors by excision 
and cyclization in the ciliate protozoan Tetrahymena. 
They named these catalytic RNAs “ribozymes”,? later 
categorized as ‘self-splicing group I introns’. 

Group II self-splicing introns were recognized in 
1983 in organelle genomes by François Michel and 
Bernard Dujon.?% They consist of a catalytically active 
intron RNA and an intron-encoded reverse transcrip- 
tase, enabling intron proliferation within genomes. 
Group II intron RNA catalyzes its own splicing via 
transesterification reactions that are the same as those 
of spliceosomal introns, yielding spliced exons and 
an excised intron lariat RNA.?96205 Thus, group II 
introns appear to be the ancestors of modern spli- 
ceosomal introns?0%-213 and likely entered the eukary- 
otic lineage through the bacterial ancestor of the 
mitochondrion.21%215 

Also in 1983, Sidney Altman, Norman Pace and col- 
leagues showed that RNA is the catalytic component of 
the bacterial RNase P complex, which produces mature 
tRNAs by cleaving a 5' end sequence in a process analo- 
gous to splicing.” The RNA in RNaseP is one of the 
only two ribozymes found in all domains of life?" a 
closely related eukaryotic RNA was also described in the 
early 1980s?* and later shown to be the catalytic center 
of a complex involved in sequence-specific processing of 
mitochondrial and other RNAs,?? hence called RMRP 
(RNase MRP), although it is mainly located in the 
nucleolus and has other important regulatory functions 
(Chapter 13) (Figure 8.5).220.221 

The demonstration that RNA molecules are able to 
cleave and join themselves or other RNAs, and capable of 
the phosphodiester bond transfers needed for RNA syn- 
thesis, prompted Walter Gilbert to formalize the “RNA 
World" hypothesis.?? In this view, RNA molecules, not 
proteins, were the precursors of existing life, having per- 
formed the catalytic and information storage functions! in 
the pre-cellular world.* 

Since then self-cleaving ribozymes have been 
found in bacteria, protists, fungi, plants, nema- 
todes, arthropods, insects and vertebrates, including 


i This view is supported by many observations, including that an RNA 
polymerase ribozyme obtained by in vitro evolution can copy com- 
plex RNA templates, including itself, albeit at low fidelity.22* There 
are also plausible scenarios for the prebiotic synthesis of the pyrimi- 
dine and purine building blocks of RNA. 

RNA can also nucleate “liquid crystal' phase-separated domains, 
explored early on by Oparin (Chapter 2), who worked extensively on 
the role of RNA as a polyanion in the formation of 'coacervates”,2 a 
property that would come to the fore as central to both modern cell 
biology and prebiotic evolution as RNAs would have been able to 
sequester organic molecules in a proto-cell (Chapter 16). 
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FIGURE 8.5 Secondary structures of six types of ribozymes. (Reproduced from Takagi et al.22 with permission of Oxford 
University Press. The ribozyme or intron portion is printed in green. The substrate or exon portion is printed in black.) 


human,222227-234 one of which has been shown to meth- 
ylate other RNAs.2% More surprisingly for its impli- 
cations, B2 and Alu elements — “repetitive” sequences 
that occur in ~350,000 copies and over | million copies 
in the mouse and human genomes respectively — have 
been shown to harbor self-cleaving ribozyme activ- 
ity that is induced upon stress and T-cell activation 
by binding to the Polycomb histone methyltransferase 
protein EZH2? (Chapter 16). 

It has been reasonably postulated that modified nucle- 
otides may have enhanced the early catalytic capacities 
of RNAs and facilitated their path to self-replication.?% 
Ribosomal RNAs, small nucleolar RNAs and spliceo- 
somal RNAs are all heavily modified, and RNA modi- 
fication has been widely deployed as a mechanism 
to introduce plasticity into RNA regulatory circuits 
(Chapter 17). 


THE CATALYTIC HEART OF 
SPLICING AND TRANSLATION 


In 1992, Harry Noller and colleagues showed that rRNA 
is not just a structural scaffold for the ribosome, as had 
been widely assumed, but harbors the central peptidyl 
transferase activity for protein chain extension in trans- 
lation, making the ribosome a complex and conserved 
ribozyme?7?235 (Figure 8.6). RNAs must therefore have 
pre-existed proteins. 

As noted above, RNA splicing is also an RNA cata- 
lyzed reaction, with sequence and mechanistic similari- 
ties to group II self-splicing introns in bacteria, capable 
of inserting into new locations by reversal of the splicing 
reaction.209.239 

Many small molecules, including antibiotics, target 
ribosomal and other RNAs: these include tetracyclins, 


! U2 and U6 snRNAs interact to form the conserved structure of the 
catalytic triplex, coordinating two magnesium ions to form the active 
site of the spliceosome.?? 
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(b) 


FIGURE 8.6 The active site of the ribosome, the peptidyl transferase center, is located on the large ribosomal subunit within a 
highly conserved region of the ribosomal RNA (red). (a) Model of RNA structure. (b) detail. (Image courtesy of Marina Rodnina 
and Wolfgang Wintermeyer (Max Planck Institute for Biophysical Chemistry, Góttingen).) 


aminoglycosides such as streptomycin and gentamycin, 
chloramphenicol, carbomycin A, blasticidin S, puromy- 
cin and hygromycin B.2%2* They also include poisons 
such as ricin, an N-glycosidase RNA-modifying enzyme 
that depurinates a conserved loop of 28S rRNA and 
leads to irreversible arrest of protein synthesis.?*? Indeed, 
therapeutic targeting of diverse RNA types by small mol- 
ecules is an area of rapidly growing interest. 


THE DIGITAL AND ANALOG FACES OF RNA 


This period also saw the expansion of RNA structural biol- 
ogy, led by Noller, Eric Westhof, Tom Steitz, Robin Gutell, 
Jennifer Doudna and others, who showed that RNAs have 
extraordinarily complex structures, capable of binding 
proteins, as a consequence of being able to form hydro- 
gen bonds on all three faces — the Watson-Crick face, the 
Hoogsteen face (within the double-stranded groove) and 
the ribose, because of its 2' hydroxyl, which is lacking in 
DNA .2*^7^9 They also have exposed sequences that can 
base pair with other RNAs and DNA, through RNA:RNA 
duplexes, R-loops (RNA:DNA duplexes with displaced sin- 
gle stranded DNA) and RNA:DNA:DNA triplexes, which 
are common in eukaryotic chromatin?*-24 (Chapter 16). 
The structural versatility of RNA has been explored 
and exploited in vitro. In the 1990s, the groups of Larry 
Gold and Jack Szostak developed SELEX ('Systematic 


Evolution of Ligands by Exponential Enrichment) 
to evolve artificial RNAs that bind specific ligands 
(‘aptamers’) or have other activities,?2% speculating 
that the same will have occurred in vivo.2% 

In addition, “free” low molecular weight circular RNA 
molecules lacking protein-coding capacity but having 
"peculiar" secondary structures and ribozyme activity 
were discovered in the 1970s to infect and autonomously 
replicate in plant cells (baptized as ‘viroids’),?>+?°> 
and postulated to represent living fossils? subject to 
Darwinian evolution in a prebiotic world.” Other curi- 
ous virally associated RNAs were also described in the 
late 1970s and 1980s.255-260 


CANDLES IN THE DARK 


By 1985, it was becoming apparent that RNAs are multi- 
faceted molecules that have specific subcellular locations, 
form complex structures, interact with (many) proteins and 
perform a vast array of functions beyond protein synthe- 
sis, from gene regulation to acting as components of cel- 
lular complexes, catalytic molecules and antisense guides. 
These observations did little to challenge the assump- 
tion that the destination of genetic information is (nearly 
always) the production of a protein and were regarded as 
interesting but idiosyncratic additions to the tapestry of 
molecular biology, rather than the first indications of a 
wider role for RNA in cell and developmental biology. 
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In his 1986 article describing the RNA World hypoth- 
esis, Gilbert proposed that, after the emergence of DNA 
as the carrier of genetic information, RNA was then 
"relegated to the intermediate role that it has today — no 
longer the centre of the stage, displaced by DNA and the 
more effective protein enzymes”.22 Others concurred.?! 

Penman lamented in 1991: 


If genes just make proteins and our proteins are 
the same, then why are we so different? ... we 
have the bizarre proposal dominating biology 
that the incredibly complex living systems are 
described entirely by component proteins and 
their coding sequences. Where is the genetic 
information that executes the design of an organ- 
ism? We do not have to look far for a candidate. 
There is plenty of information in the more than 
95% of the genome that is devoid of open reading 
frames. These sequences, heavily transcribed in 
all cells, appear to have little, if anything, to do 
with making proteins.? 
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Glimpses of a Modern RNA World 


The discovery of abundant small RNAs with functions 
beyond translation and the recognition of their target 
specificity by base-pairing prompted exploration of the 
regulatory potential of ‘antisense’ molecules.! In the late 
1970s, Paul Zamecnik (whose group discovered tRNA, 
Chapter 3) and colleagues demonstrated that binding 
of short synthetic oligonucleotides to complementary 
sequences in Rous sarcoma virus and human T-cell lym- 
photropic virus blocked the replication and translation of 
viral RNA and the oncogenic transformation of cells.?? 

Such findings suggested that short antisense RNAs 
might exist naturally in cells, not as common species 
involved in core processes but as (individually rarer) 
regulators of specific genes or transcripts, but difficult to 
identify and characterize by the analytical techniques of 
the time. 


RIBOREGULATORS 


The clues were already there. Studies in the 1960s had 
revealed a number of small RNAs (SRNAs) of unknown 
function in bacteria.^5? The existence of ‘antisense’ 
RNAs and ‘bidirectional transcription’ was first reported 
in 1972 in phage lambda, where it was proposed to con- 
trol expression of the lambda repressor.'” The subsequent 
sequencing of the genomes of bacteriophages and eukary- 
otic viruses showed that the occurrence of overlapping 
genes and transcripts was a general phenomenon.!!- ^ 
Conserved bacterial RNAs such as the 10Sa? and 10Sb* 
RNAs,”°?! as well as regulatory motifs and structures, 
had also been reported around that time, for example, in 


a One of these sRNAs was discovered in 1967,4 9 months after the 
identification of the /ac repressor. It was initially dubbed 6S (also 
known as SsrS) and later shown to be a structurally conserved mol- 
ecule that regulates RNA polymerase promoter use. One can only 
speculate what the impact on the conceptual framework of RNA and 
protein function in molecular biology might have been if this had 
come to light earlier. 

10Sa RNA is also known as tmRNA (a ‘transfer-messenger RNA’ 
with properties of a tRNA and an mRNA) or SsrA.P-" It was later 
shown to play a key role in the symbiosis between the bacterium 
Vibrio fischeri and squid? (Chapter 12). 

10Sb RNA was later found to be the bacterial homolog of the RNase 
RMP ribozyme! (Chapter 8). 


T 
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feedback mechanisms controlling rRNA and ribosomal 
protein levels, akin to RNA structure-dependent regula- 
tory mechanisms identified in bacteriophages.??-*° 

In 1975, Stuart Heywood and colleagues demon- 
strated that short RNA sequences from chicken muscle 
ribonucleoprotein fractions could control translation 
of the mRNA encoding myosin. They called the RNAs 
“translation control RNAs" (tcRNAs) and proposed 
that tcRNAs act by binding their mRNA targets in a 
sequence-specific manner.?”28 [n follow-up work a decade 
later, Heywood demonstrated that one of these tcRNAs, 
tcRNA102, recognizes a sequence in the SUTR of the 
myosin mRNA 990 

In the 1980s, a number of studies identified bacterial 
plasmid-encoded small ‘untranslatable’ antisense RNAs 
that formed stable secondary structures and regulated 
plasmid replication, plasmid incompatibility, transpo- 
sition and translation, among others.*!*> For example, 
mutation analysis revealed that the -108nt antisense 
transcript ‘RNA P blocked replication of the ColEl 
plasmid (Chapter 6) by base pairing with the RNA that 
forms the replication primer?! — one of the first regulatory 
roles demonstrated for any RNA (see Chapter 8). Soon 
after, a ~70nt RNA transcribed from a promoter of the 
Tn10 transposon was shown to repress transposition by 
preventing translation of the transposase mRNA, rep- 
resenting the first example of transposon regulation by 
antisense RNAs.* A ~170nt antisense RNA (micRNA) 
expressed in E. coli was found to inhibit translation of 
OmpF mRNA, which encodes a major outer membrane 
protein.? Packaging of phage DNA during infection was 
found to be directed by the phage-encoded -120nt phi29 
RNA, as part of the DNA-packaging machine.?? 

Before the end of the decade, enough examples had 
accumulated to allow generalizations around the theme 
of antisense RNA control of gene expression and the 
potential of fine-tuning interactions in a way not read- 
ily achieved by proteins." Masayori Inouye speculated 
at the time that this "regulatory system may be a general 
regulatory phenomenon in E. coli and in other organ- 
isms, including eukaryotes”38 and that “RNA species 
may have additional roles in the regulation of various cel- 
lular activities"? 
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In subsequent decades, it was shown that sRNAs regu- 
late many bacterial processes, including virulence, quo- 
rum (community) sensing, symbiosis, stress responses, 
the physiological transition from growth to stationary 
phase, other aspects of metabolism and environmental 
responses, bacteriophage packaging, DNA exchange, 
transcription and translation, among others.!8384243 
Thousands of bacterial SRNAs that regulate gene expres- 
sion at both transcriptional and post-transcriptional 
levels? have now been described, ^^-^6 aided by new high- 
throughput RNA-protein interaction technologies. 

The following examples are illustrative. The E. coli 
DsrA RNA (Figure 9.1), which is induced at low tempera- 
tures, inhibits transcriptional silencing by the nucleoid- 
associated H-NS protein and stimulates translation of the 
stress sigma factor RpoS, both depending on association 
with the RNA-binding protein Hfq.?'^? Another ~109nt 
sRNA, oxyS, was found to repress translation of RpoS 


d As explained by Kai Papenfort and Jörg Vogel: “The late appre- 
ciation of regulatory RNA might be attributed to the fact that loci 
encoding such regulators were rarely selected in genetic screens for 
virulence factors, likely owing to a usually smaller gene size, miss- 
ing annotations in genome sequences, and typically subtle pheno- 
types, as compared to virulence-associated proteins."?? For example, 
the ~514nt RNAIII was originally described as the 6-hemolysin 
mRNA of Staphylococcus aureus but subsequent molecular analysis 
revealed that, in addition to expressing hemolysin from its 5' region, 
RNAIII acts as an antisense regulator of virulence and surface pro- 
tein synthesis through its 3 region, a dual coding and regulatory 
RNA 4041 


by interacting with Hfq and altering its activity, acting as 
a global regulator to activate or repress the expression of 
approximately 40 genes involved in stress responses.*+> 
In fact, many sRNAs, such as the Spot 42 sRNA that reg- 
ulates the galactose operon (Chapter 3), also require Hfq* 
for their stability and function.?^5* 

The common involvement of Hfq, which acts as a 
general cofactor for stabilizing small antisense RNAs, 
facilitating RNA-RNA interactions and gene expression 
control in many bacteria,60-6* including the regulation of 
utilization of the intestinal metabolite ethanolamine by a 
Hfq-dependent sRNA,® indicated the existence of broader 
RNA-regulated networks.*% This was also an early exam- 
ple of the use of a generic protein infrastructure to execute 
RNA-directed regulatory events, a theme that would later 
be writ large in eukaryotes (Chapters 12 and 16). 

Other RNAs that control global processes in bacteria 
were also discovered, such as the inducible CsrB and CsrC 
RNAs of E. coli, which bind (via conserved sequences and 
hairpin structures) and inhibit the RNA-binding protein 
CsrA, a translational regulator, by outcompeting mRNA 
targets. Homologs of this system have been implicated 
in the regulation of gluconeogenesis, biofilm formation and 
virulence factor expression in a variety of bacterial patho- 
gens,” and represent some of the first examples of mimicry, 


* Hfq was originally described in 1968 as an E. coli host factor 
required for the synthesis of bacteriophage QB RNA and the replica- 
tion of the bacteriophage QB RNA genome.* 


Glimpses of a Modern RNA World 


protein-sequestration or sponging by regulatory RNAs at a 
post-transcriptional level. 

In 2020, the late promoter of the Shiga toxin-encoding 
bacteriophage in enterohemorrhagic E. coli was found 
to produce an abundant regulatory RNA to silence the 
expression of the toxin during lysogeny (the Shiga toxins 
cause renal failure and neurological damage), which had 
been hiding in plain sight despite decades of research on 
Shiga toxin.® 

Synthetic riboregulators have been constructed for 
eukaryotic translational control. A common principle 
underlying the functions of these small regulatory RNAs 
is the ability to combine secondary structures that can 
bind proteins or small ligands (as exemplified by some 
ribozymes®”° and SELEX) with exposed nucleotide 
stretches that can recognize other RNAs or DNA in a 
sequence-specific manner. 


RIBOSWITCHES 


In 2002 and following years, Ron Breaker, Wade Winkler, 
Alexander Mironov, Evgeny Nudler and others showed that 
the ability of RNA to sense ligands, previously thought to 
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be the sole province of proteins, is widely used by bacte- 
ria to connect the regulation of transcription and translation 
to metabolic and environmental signals, including thiamin 
(vitamin Bl), riboflavin-5-phosphate (vitamin B2), biotin 
(vitamin B7), cobalamin (vitamin B12), fluoride, various 
amino acids, S-adenosyl methionine (SAM) and glucos- 
amine-6-phosphate, among many others, and even tempera- 
ture (RNA “thermometers').71-7 These RNA ligand-sensing 
modules have become known as 'riboswitches'/5-*! with 
more evident in genomic analyses,? and high-resolution 
studies revealing the molecular dynamics involved.*? 

For example, the SAM riboswitch (the 'S-box leader’) 
is a highly conserved RNA domain that responds to the 
coenzyme SAM with high affinity and specificity. In 
Bacillus subtilis, it occurs in the 5’ region of dozens of 
genes encoding proteins involved in methionine or cys- 
teine biosynthesis, where it allosterically regulates their 
expression at the level of transcription termination. When 
SAM is unbound to the RNA aptamer, the anti-termina- 
tor sequence sequesters the terminator, which is then 
unable to form, whereas when SAM is bound, the anti- 
terminator is sequestered and transcription is terminated 
(Figure 9.2).738485 
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FIGURE 9.2 Structure and function of the S-adenosyl methionine (SAM) riboswitch in the 5' untranslated region of the mRNA 
in the polycistronic met operon of Xanthomonas campestris, which encodes three enzymes for the biosynthesis of methionine, 
replaced here by the reporter gene gusA. Binding of SAM to the RNA aptamer in the riboswitch (boxed) causes an allosteric 
structural rearrangement that sequesters the Shine-Dalgarno sequence (purple) and AUG start codon (cyan) to inhibit translation. 
(Adapted from Tang et al. under license from Creative Commons.) 
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Thiamin riboswitches also occur in plants, fungi and 
protists,$6-88 and shown, inter alia, to regulate RNA 
splicing.? Riboswitches likely also exist in animals, 
although their repertoire is not so well explored, possi- 
bly because of the difficulty of their characterization in 
complex organisms. Artificial riboswitches have been 
constructed to respond to pH?? and light?! and to control 
RNA splicing? 

Ligand-induced allosteric changes in RNA structure 
are similar to those observed in proteins upon binding of 
small molecules, such as nucleotides (ATP, AMP, GTP, 
etc.) to sense energy status or transduce extracellular sig- 
nals, and even the /ac repressor's recognition of lactose, 
which causes a conformational change in the repressor so 
that it can no longer bind DNA to block transcription of 
lactose metabolizing enzymes. 

Riboswitches may have predated proteins and have 
been suggested to be the oldest mechanism for the regu- 
lation of gene expression? As Breaker speculated: “The 
characteristics of some riboswitches suggest they could 
be modern descendants of an ancient sensory and regula- 
tory system that likely functioned before the emergence of 
enzymes and genetic factors made of protein."? 


ANTISENSE RNAs AND COMPLEX 
TRANSCRIPTION IN EUKARYOTES 


The first evidence of regulatory antisense RNAs in 
eukaryotes was obtained in 1987, also by Zamecnik's 
group, who reported endogenous small (<30nt) RNA 
oligonucleotides in mammalian cells using radioactive 
labeling, proposing that they “may play a regulatory role 
in intracellular metabolism and may conceivably travel 
from one cell to another in a similar role”.% It was another 
13 years before the ubiquity and power of such regulatory 
‘microRNAs’ in eukaryotes started to be revealed and 
appreciated (Chapter 12). 

Nonetheless, based on the principles of the activity of 
antisense RNAs in bacteria, as well as experiments by 
Zamenick's group with exogenous antisense oligonucle- 
otides, “anti-message” RNAs began to be used from 1984 
as a tool for suppressing the expression of specific genes! 
in eukaryotes, including globin, at the level of transcrip- 
tion, translation and/or RNA stability!??-1?? before the dis- 
covery of natural antisense transcripts in eukaryotes.!05.104 

The use of synthetic antisense oligonucleotides was 
quickly adopted and is still widely employed to study gene 


f Antisense interactions between cDNAs and mRNAs (‘hybrid- 
arrested translation’) was developed in the late 1970s for gene map- 
ping and identification.?6-95 
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function in a wide range of eukaryotes, including frogs, 
insects, plants and mammalian cells, as specificity of 
inhibition is easy to achieve independently of any knowl- 
edge of the function of the gene under investigation.!% 

Antisense oligonucleotides also form a vital com- 
ponent of the toolkits for genetic engineering and gene 
therapy,* aided by artificial chemistries, such as peptide or 
phosphorothioate linkages and methylene bridges (locked 
nucleic acids"), to increase the target affinity and half-life 
of the molecules in vivo.!?7-!!! The use of synthetic nucleic 
acids has since been given new impetus by the discovery 
of another natural antisense RNA regulatory pathway, the 
small RNA-guided ‘CRISPR’ systems that have revolu- 
tionized genetic engineering (Chapter 12). 

At that time, however, despite the emerging examples 
of regulatory sRNAs in bacteria and the ability of anti- 
sense molecules to artificially modulate gene expression 
in eukaryotes, "the extent to which this novel form of 
regulation of gene expression is utilized in prokaryotes 
and eukaryotes ... [remains] ... to be established"? 

The first discoveries of natural regulatory RNAs in 
eukaryotic cells were serendipitous^ — a pattern repeated 
over the next -15years, until the genome projects 
revealed the full extent of RNA expression (Chapter 13). 
Nevertheless, an unexpected by-product of genetic screens 
and conventional gene cloning and mapping approaches 
were many early observations that hinted at the existence of 
longer (2200nt) non-protein-coding RNAs in eukaryotes. 

These early studies also revealed the existence of 
‘nested’ genes and non-coding transcripts in intensely 
studied genomic regions, such as developmental and 
cancer-related loci, as well as in studies using differential 
cDNA cloning and hybridization strategies! to identify 
transcripts from genes that are active or repressed in spe- 
cific tissues and/or developmental stages. 


£ The first RNA therapeutics company, Isis, now Ionis, was established 
in 1989 by Stanley Crooke.!% As of 2021, eight antisense oligonucle- 
otide drugs had been approved for commercial use.!°° 

One of the relevant discoveries during this period was the HIV TAR 
(trans-activating response element), an RNA stem-loop structure 
located at the 5’ ends of nascent HIV-1 transcripts, which was pro- 
posed to be a “novel type of regulatory element" for transcriptional 
activation.!!5-115 In addition to the viral regulatory RNAs mentioned 
in the previous chapter, several other non-coding RNAs were sub- 
sequently characterized from DNA and RNA viruses that infect 
eukaryotic cells. These include highly abundant small RNAs and 
IncRNAs discovered in the late 1970s and 1980s, such as the EBV- 
encoded RNAs found to have different roles including recruitment of 
transcription factors to control expression''* (Chapter 16) and the 2.7 
kb repeat-derived RNA that comprises ~20% of the early transcrip- 
tion from the human cytomegalovirus Beta2.7 gene!" and whose 
function only started to be identified decades later.!!* 

i Subtractive cDNA hybridization and differential display." 


Glimpses of a Modern RNA World 


In 1986, Steven Henikoff and colleagues reported 
the first case of a "gene within a gene" in Drosophila, 
showing that a pupal cuticle protein is encoded within 
the intron of an unrelated gene, on the opposite strand 
and independently expressed. They described their find- 
ings as “an unambiguous exception to the classical linear 
model of gene organization" and, considering the possi- 
ble commonality of genes nested within large introns and 
extended loci, remarked that “it is interesting to consider 
the genetic complexity that could result”.!20 

In the same year, Trevor Williams and Mike Fried 
showed that a region of the mouse genome encodes two 
RNAs that are transcribed in opposite direction and over- 
lap at their 3' ends, contemplating the implications in the 
light of the findings of experimentally introduced anti- 
sense RNAs inhibiting gene activity.!! 

Similarly, in the same issue of Nature, Charlotte 
Spencer and colleagues reported that a transcript of 
unknown function overlaps that of the dopa decarboxyl- 
ase (Ddc) gene on the opposite strand in Drosophila.!?? 
Given that the transcripts showed differences in tempo- 
ral and spatial expression, they proposed that the anti- 
sense transcript could have regulatory function based 
on either RNA-RNA base pairing or via transcriptional 
interference and that "such arrangements in eukaryotes 
may be more common than previously supposed", 22.7? a 
prediction that was confirmed 20 years later when high- 
throughput transcriptome analyses were undertaken in 
the wake of the genome projects!2*-126 (Chapter 13). 

Also in 1986, Alain Nepveu and Kenneth Marcu showed 
that the protein-coding and opposite complementary strands 
of the c-Myc locus in mice are transcribed and regulated 
independently,'?’ confirmed the following year by Gail 
Sonenshein and colleagues,"? suggesting a role of the anti- 
sense RNAs in c-Myc processing or transcriptional inter- 
ference."7 The TP53 tumor suppressor locus was shown in 
1989 to also express a long antisense RNA, “inRNA!, specu- 
lated to be involved in the maturation of p53 mRNAs."? 

Other examples in different organisms followed. An 
antisense RNA expressed in the silk moth Bombyx mori 
was found to display extensive complementarity to the 
chorion gene Hcb./2 and to be co-expressed in follicular 
cells during development.!% An RNA antisense (aHIF) 
to the 3'UTR of the human Hypoxia-inducible factor-1 
alpha (HIFla) mRNA was found to be co-expressed 
in cancer and hypoxia.!*! The expression of other anti- 
sense RNAs was found to negatively correlate with that 
of their complementary protein-coding mRNAs, such 
as the intronic antisense RNA from the human e/F2A 
locus,!*? transcripts antisense to the chicken alpha-I col- 
lagen gene? and transcripts antisense to the EB4 locus 


101 


in the slime mold Dictyostelium, whose functional anal- 
ysis suggested a role in regulating the stability of EB4 
mRNA. 

In some cases, such as the tcRNA identified by 
Heywood, antisense RNAs had only limited sequence 
complementarity to their targets, suggesting the poten- 
tial existence of ‘trans-acting’ RNAs that originated from 
different loci.31% Other cases involved large regions of 
overlap between antisense RNA and mRNA, sometimes 
spanning most of the length of the transcription units.!36 

A large number of similar but disparate observations 
followed at the end of the decade, including demonstra- 
tions of other nested genes and sense-antisense pairs in 
plants, insects, birds and mammals,?6-? including in 
well-studied loci such as the globin clusters, which also 
showed evidence of transcription of non-coding regions 
encompassing enhancers and cis-regulatory elements,1%1% 
as did loci exhibiting parental imprinting (see below). 

Some of these early reports considered the conceptual 
and practical implications of the interleaved organization 
of genes and transcripts, challenging orthodox concepts, 
especially that the exons of protein-coding transcripts are 
the only biologically relevant portions of genes. As put by 
Adelman and colleagues in 1987: 


this situation may be a significant form of molecu- 
lar evolution. By using both strands of the same 
DNA, the information content (regulatory and/ 
or structural) of a particular genetic segment 
becomes amplified, adding a new complexity to 
the concept of a eukaryotic gene.!% 


LONG UNTRANSLATED RNAs 


In addition to antisense RNAs, a number of other “uncon- 
ventional” RNAs were found to be transcribed from 
“intergenic” regions of eukaryotic genomes and associ- 
ated with genetic effects, notably in the well-studied reg- 
ulatory regions of the bithorax complex studied by Lewis 
in Drosophila.'46 

As noted in Chapter 5, David Hogness, Michael Akam 
and colleagues discovered in 1985 that only one of the 
five mapped mutations (‘pseudoalleles’) in the bitho- 
rax complex corresponded to a protein-coding gene 
(Ultrabithorax or Ubx), while the others are derived from 
a much larger region containing regulatory elements. 
The pseudoallelic mutations were located in introns or 
in the upstream bxd region: the latter was found to be 


| That is, transcribed from genomic sequences between annotated 
protein-coding genes, a description reflecting the deep bias that 
genes — proteins. 
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transcribed into a -27kb RNA that has a number of large 
introns and is subjected to differential splicing to pro- 
duce various smaller (~1.2 kb) polyadenylated non-coding 
RNAs, none of which has protein-coding potential.!47!48 
The expression of these transcripts was also shown to be 
highly regulated during embryogenesis, in a pattern that 
is partially reflective of Ubx.!7.14 

Moreover, while the extended BX-C cluster contained 
three protein-coding genes (Ubx, abd-A and Abd-B), it 
produced at least seven distinct RNAs that are co-linearly 
transcribed during development!*”15! (Figure 9.3). Other 
non-protein-coding transcripts were also reported in the 
nearby iab-4 locus.15015 As their relevance was unclear, 
it was mooted that these transcripts are *functionless"!? 
or “might function in cis by some unprecedented mecha- 
nism”.!5! Some suggested that transcription of these loci 
is a passive by-product of the recruitment of transcription 
factors to enhancer sites that act distally by looping to con- 
tact promoters of protein-coding genes, or that it is sim- 
ply the act of transcription (not the transcribed RNA) that 
is relevant by remodeling and/or exposing chromatin and 
underlying DNA sequences to transcription factors.153154 
Many (presumed) cis-regulatory elements encompassing 
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different developmental ‘enhancers’ and ‘response ele- 
ments’ for the epigenetic regulators Polycomb and 
Trithorax!5%1% have been characterized in the loci that 
express these RNAs," but their mode of action is only 
now being elucidated, especially in the context of RNA- 
directed chromatin modifications that control the expres- 
sion of the clusters (Chapters 14 and 16). 

Another early example is the 93D locus, one of the 
largest of the genomic regions in Drosophila that ‘puff’ 
(i.e., become transcriptionally active) after heat shock. In 
the early 1980s, the groups of Subhash Lakhotia and Mary 
Lou Pardue found that the 93D locus does not specify a 
protein, but rather a set of rapidly evolving non-coding 
transcripts (although an intron is highly conserved!*), 
known as hsr omega RNAs: long repeat-containing tran- 
scripts with at least three different isoforms of -1-10kb 
that are differentially expressed in different tissues and 
developmental stages.15?-164 

Transcription of repeat-containing spliced and polyad- 
enylated long non-coding RNAs was also reported in loci 
involved in immunoglobulin class switching and recom- 
bination in the 1980s, with evidence that such transcripts 
are associated with ‘enhancer’ action and alteration of 
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FIGURE 9.3 Map of the bithorax complex in Drosophila, showing the coding and non-coding transcripts produced from each 
locus - the protein-coding genes Ubx, abd-A and Abd-b, and the regulatory genes bxd, iab-4 and iab-8 — and their expression in 
different segments of the fly. The non-coding RNAs expressed from iab-4 to iab-8 are also regulated by microRNAs (Chapter 12). 
(Reproduced from Garaulet and Lai!'% with permission from Elsevier.) 
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chromatin architecture, including V(D)J recombination 
at antigen receptor loci,!6-1% whose roles in these pro- 
cesses are now being understood.1%-172 

Many long ‘nontranslatable’ antisense, sense and 
intergenic RNAs from higher eukaryotes were also 
cloned during the 1980s and 1990s. One of the first 
was an interspersed maternally derived 3.7kb non- 
coding RNA called ISpl, characterized by Britten and 
Davidson's group in 1988, studying the sea urchin 
Strongylocentrotus purpuratus. 'This transcript, whose 
function is unknown, is polyadenylated and apparently 
processed into shorter (400—600nt) RNAs stored in the 
cytoplasm of sea urchin eggs.!”* 

Large ‘transcription units’ of unknown function that 
appeared to lack protein-coding potential were being 
reported in humans as early as 1985, notably from the 
PVT-1 (plasmacytoma variant translocation) locus that 
activates expression of the MYC oncogene and is heavily 
implicated in cancers through amplifications, retroviral 
insertions and translocations. The PVT-/ locus spans at 
least 200 kb and expresses large multi-exonic non-coding 
RNAs that initiate 57 kb downstream of MYC."+181 

Other  cancer-associated long non-coding RNAs 
(IncRNAs) identified in following years, such as BIC, 182-185 
'TP53TGI (“TP53 target gene 17)56/57 and DD3 (Differential 
Display Code 3, also known as PCA3, prostate cancer anti- 
gen 3),555? turned out, like PVT1, to be, at least in part, 
microRNA precursors (Chapter 12), as did, for example, the 
developmentally regulated and highly conserved 7H4 RNAs, 
found by subtractive hybridization to be highly enriched in 
synaptic nuclei of rat skeletal neuromuscular junctions.!?.9? 

However, the first mammalian long non-coding RNA 
to be well recognized was H19. It was cloned in 1990 by 
Shirley Tilghman and colleagues by differential hybrid- 
ization, and corresponded to a transcript that was origi- 
nally identified by the same group in 1984 as an abundant 
RNA in a screen of a mouse fetal liver cDNA library 
(named after the cDNA clone designated pH19).?? It was 
also found to be expressed in rat skeletal muscle (where 
it was called ASM).?^ H19 is transcribed by RNA poly- 
merase II, spliced and polyadenylated, with a number of 
short open reading frames that are not conserved between 
mouse and human.* It also did not seem to be associated 
with ribosomes. H/9 was, therefore, proposed to represent 
an "unusual gene" whose product differs from a “classical 
mRNA" in that it may act as an RNA, not an intermedi- 
ate to a protein.!% It was subsequently found to be part of 
an imprinted locus that encompasses the /g/2 (insulin-like 


* Later proteomic analysis indicated that H19 produces a short protein 
in humans but not mice, suggesting that it was either gained or lost 
in one or other lineage. 
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growth factor) gene, to have tumor suppressor capac- 
ity/?92?! and to be lethal when ectopically expressed.?? 

In the following year, Carolyn Brown, Hunt Willard 
and colleagues cloned an RNA expressed exclusively 
from the *X-inactivation center' in the inactive X chro- 
mosome in females. This alternatively spliced transcript 
was a candidate for the factor responsible for dosage 
compensation and was named Xist (“X-inactive specific 
transcript)? They found no evidence of an encoded 
protein, and characterized human XIST! as a 17-kb RNA 
containing conserved tandem repeats, localized within 
the nucleus and coating the inactive X chromosome in 
female cells in a position “indistinguishable from the X 
inactivation-associated Barr body", leading them to pro- 
pose that Xist functions as a “structural RNA”.204 

Xist functions partially by recruiting chromatin- 
repressive complexes to promote heterochromatin forma- 
tion and transcriptional silencing of the chromosome”? 
(Figure 9.4), although how it selects just one of the two 
X-chromosomes" is not fully understood. While thought 
of as a special case at the time, Xist has become emblem- 
atic of the extraordinary complexity of non-coding RNA 
control of chromatin architecture including the nucleation 
of phase-separated domains (Chapter 16). It emerged later 
that the Xist locus originated by the fusion of a ‘pseudoge- 
nized' protein-coding gene with a set of transposable ele- 
ments that are essential to its function.?!2213 And it does 
explain how the heterochromatic Barr body described by 
Ohno (Chapters 4 and 7) is formed. 

Analogous non-coding RNAs balancing X-chromosome 
dosage in Drosophila, roX1 and roX2 (RNA on the X chro- 
mosome), were identified by use of enhancer traps and male- 
specific hybridization a few years later by Victoria Meller, 
Richard Axel, Mitzi Kuroda, Ron Davis, Richard Kelley, 
Asifa Akhtar and colleagues. The roX RNAs (whose activity 
is modulated by alternative splicing) act not to repress one of 
the two X-chromosomes in females but to globally upregu- 
late gene expression from the single X chromosome in males 
via tandem stem-loop structures that bind effector proteins 
to remodel chromatin and compartmentalize the X chromo- 
some (also involving Phase Separation, Chapter 16).24-22? 

By the end of the 1990s, dozens of IncRNAs with reg- 
ulatory functions had been identified in a wide variety of 
eukaryotes.22322 Early examples that hinted at the diversity 
of these non-coding RNAs included: ‘meiRNA’ in fission 


! The gene naming convention uses all capitals for human genes, and 
only the first letter in uppercase (e.g., Xist) in mouse. 

m There are differences between rodents and primates, associated 
with distinctions in early development. The paternal allele of Xist is 
silenced in the trophectoderm (placenta) in mice, but not in humans. 
Seemingly random inactivation of parental alleles occurs in human 
trophectoderm and in both human and mouse embryos.?!92!! 
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FIGURE 9.4 Localization of Xist on the transcriptionally inactive condensed X chromosome in interphase female human cells 
(the Barr body, with intense DAPI DNA staining), which is accompanied by repressive histone modifications such as methylation 
of H4K20 and H3K27, and H2AK119 ubiquitination (Chapter 14). (Image courtesy of Jeanne Lawrence,?%20 UMass Chan School 


of Medicine). 


yeast, essential for the pairing of homologous chromosomes 
in meiosis,22522 involving phase separation?" (Chapter 
16); the dutA RNA in Dictyostelium, induced in develop- 
ment during mold aggregation;7??*? an “unusual family” 
of 1.82.4 kb transcripts in the malarial protozoan parasite 
Plasmodium falciparum involved in the expression or rear- 
rangements of virulence genes;?! the bifunctional enod40 
RNA in lucerne, expressed during nodule organogenesis, 
wherein the RNA structures are more highly conserved than 
the encoded peptides??? the pseudogene-derived anti- 
sense transcript? pseudoNOS (or antiNOS) suppressing the 


? The first report of an antisense transcript from a pseudogene was that 
from human topoisomerase I by Bing-Sen Zhou and colleagues in 
1992.255 


expression of the cognate nitric oxide synthase (NOS) gene 
in neurons of the snail Lymnaea stagnalis?*6?" X]sirt 
RNAs in frogs, “interspersed repeat transcripts" localized 
and playing structural roles in the vegetal pole cytoskeleton 
of Xenopus oocytes;?*?? the ‘yellow crescent RNA’ in the 
ascidian Styela clava, a maternal transcript localized in the 
zygotic myoplasm;”*°.*4! the mammalian non-coding multi- 
exonic alternatively spliced ‘growth arrest-specific” gas5 
gene that hosts several snoRNAs (Chapter 8)/??^^ and is 
also active as a mitochondrially localized long non-coding 
RNA; and SRA, a highly conserved steroid receptor 
coactivator (Figure 9.5), which was found accidentally in a 
protein-binding screen and later described as being “differ- 
ent from eukaryotic transcriptional coactivators in its ability 
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FIGURE 9.5 The structure and evolutionary conservation from fish to mammals of the steroid receptor coactivator RNA (SRA). 
(Reproduced from Novikova et al.?? with permission from Oxford University Press.) 
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to function as an RNA transcript to selectively regulate the 
activity of a family of transcriptional activators”.24624 Such 
screens also detected many other RNAs that act as tran- 
scriptional activators “when tethered to DNA" 2502» 

As in Drosophila, several IncRNAs in vertebrates 
identified during the 1990s were found to originate from 
developmental loci, often showing coordinated expression 
and functional relationships with their associated protein- 
coding genes. These included RNAs antisense to homeo- 
box-containing genes, such as HoxA11,2325 HoxpD3,55 
and Dix] and DIx6.259257 Xist was also found to be over- 
lapped by a gene specifying a 40kb unstable antisense 
IncRNA, Tsix, which was identified by RNA fluorescence 
in situ hybridization and negatively regulates Xist expres- 
sion during the early steps of X inactivation,?5825 an early 
indication of the highly intricate regulatory networks 
involving IncRNAs (Chapter 16). 

From the end of the 1990s, it emerged that the imprinted 
Igf2/H19 locus and many other imprinted loci differentially 
express overlapping sense and antisense transcripts, 95.260264 
including the /g/2 receptor (1gf2r) locus, where an astound- 
ingly long (-108kb) antisense RNA was serendipitously 
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found to be transcribed from a promoter located in an 
intron of the /gf2r gene.?® [n contrast to the protein-cod- 
ing gene /gf2r, which is expressed from the maternal allele, 
the non-coding RNA, termed Air (Antisense Igf2r RNA), 
is expressed only from the paternal allele.265-267 

Similar phenomena were observed at other loci,295-270 
including Xist,?”! Meg3?” (also known as Gtl2, first iso- 
lated by a gene trap approach in mice???) and the Kncq1 
locus. In the latter, transcription of a 91 kb IncRNA 
named KvLOTI-AS (KvLQTI antisense, later known as 
Kcnglotl), like Air, initiates in an intron of the pater- 
nal allele of the protein-coding Kncq1 gene from a ‘CpG 
island’ (a region of high GC dinucleotide content that is 
a target for DNA methylation, normally associated with 
gene repression; Chapter 14) called the imprinting con- 
trol region (IC2)?>?”? (Figure 9.6). Given their reciprocal 
expression, these antisense RNAs were proposed to be 
involved in the silencing of the associated protein-coding 
genes,?*? demonstrated for Air in 2002.28! 

Both Air and Kenglotl RNAs were later shown by 
the groups of Peter Fraser and Chandrasekhar Kanduri 
respectively, to silence transcription by binding to and 
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FIGURE 9.6 A composite figure showing (a) the genomic arrangement of the Kcnq1 imprinted locus in mouse with (b) the exon- 
intron structure of the Kncq/ gene and the antisense Kenglot transcript initiated from its 6th intron. (Adapted from Kanduri?” 


(a) and Pandey et al." (b) with permission from Elsevier.) 
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targeting the histone methylase G9a and DNA methyl- 
transferases to chromatin to alter the epigenetic state of 
the locus?82.283 (Chapters 14 and 16). 


UTR-DERIVED RNAs 


The protein-coding portion (the open reading frame) 
of mRNAs is flanked by ‘upstream’ and ‘downstream’ 
untranslated regulatory regions, referred to as 5'UTRs 
and 3’UTRs, respectively. 3’UTRs have increased in 
size with increasing morphological complexity during 
animal evolution, especially in vertebrates, where they 
usually occupy as much or more of the mRNA as the 
coding sequence in mammals and are often highly con- 
served.285-287 3’UTRs contain modules that bind regula- 
tory proteins and small RNAs (Chapter 12) to control the 
translation, localization and stability of the mRNA 255259 

In 1993, Helen Blau and colleagues discovered that the 
3’UTRs of three muscle associated genes (troponin I, tropo- 
myosin and a-cardiac actin) could inhibit cell division and 
suppress malignancy in a myogenic cell line independently 
(i.e., in the absence) of the normally associated protein- 
coding sequence??? Other trans-acting 3/UTR-derived 
RNAs with similar properties were reported from the genes 
encoding ribonucleotide reductase?” and prohibitin.??? It 
was also shown that the loss of oogenesis caused by the 
lack of the Drosophila gene oskar can be rescued by its 
3’UTR alone, indicating that this RNA “acts as a scaffold 
or regulatory RNA essential for oocyte development”.2% 

Later studies showed that independent expression of 
3'UTR sequences is widespread (Chapter 13), occurring 
in as many as half of all mammalian genes?5?29599 as well 
as commonly in plants.*°° These “UTR-associated RNAs" 
(uaRNAs) or “downstream of genes" (DoGs) are, at least 
in some cases, nuclear-localized and induce differentia- 
tion separately from their usually associated protein-coding 
sequences (Chapter 13).290-295297301303 Tn the testis, for exam- 
ple, the coding sequences of the Myadm gene are expressed 
in the cytoplasm of the interstitial cells, whereas the 3'UTR 
is not expressed in these cells but highly expressed in the 
nuclei of germ and Sertoli cells. This phenomenon is par- 
ticularly pronounced in the brain??? where, for example, 
the 3'UTR of the KIh131 gene but not the coding region is 
highly expressed in the cerebellum and the hippocampus??? 
(Figure 9.7), and cytoplasmic cleavage of the IMPAI3'UTR 
is necessary to maintain axon integrity.??? 

The regulatory and evolutionary logic of having a 
covalently linked RNA sequence that regulates mRNA 
activity in cis but also acts independently in trans is an 


? Exons specifying 5'UTRs have expanded in humans and are highly 
alternatively spliced.?* 
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astounding observation, whose biological raison d’étre is 
yet to be satisfactorily explained, but is emblematic of the 
complexity and mysteries of the emerging world of RNA 
regulation. It also presaged later findings that non-coding 
RNAs can act as ‘decoys’ or ‘sponges’ for bacterial*”*? 
and eukaryotic small RNAs?! and RNA-binding proteins 
(Chapters 12, 13 and 16). It is also clear, and in retro- 
spect unsurprising, that the terms messenger and regula- 
tory RNA are not mutually exclusive and that individual 
RNAs can have multiple functions (Chapter 13). 


FIRST EXAMPLES OF SMALL 
REGULATORY RNAs IN ANIMALS 


Also in 1993, two articles published by the groups of 
Gary Ruvkun and Victor Ambros described a small RNA 
that played a role in developmental regulation in the 
nematode worm Caenorhabiditis elegans.*°*>" Previous 
genetic screens by their groups had shown that the prod- 
uct of the /in-4 gene regulates the expression of lin-14, a 
heterochronic gene encoding a nuclear protein involved 
in the temporal control of post-embryonic development, 
by a mechanism targeting the 3'UTR of lin-14. 

Both groups were expecting a regulatory protein, 06307 
but the /in-4 locus mapped to the intron of a long spliced 
non-coding RNA. The Ambros’ group found that the 
primary lin-4 transcript was processed into two over- 
lapping RNAs of 61nt and 22nt in length. Similar to 
the elucidation of the roles of spliceosomal snRNAs and 
snoRNAs, both groups found that the small RNAs pro- 
duced from the /in-4 locus had partial complementary to 
a number of sequences in the 3'UTR of the /in-/4 mRNA 
(Figure 9.8). Curiously, while noticing that lin-14 protein 
levels are reduced in development without changes in the 
transcript abundance, they proposed that these "small 
temporal RNAs" formed multiple RNA duplexes that 
inhibited the translation of lin-14 mRNA. This inhibi- 
tion depended on the (partial) base pair complementar- 
ity, which was conserved in the homologous sequences 
of the related species C. briggsae.304305.308.30 Thus, it was 
speculated that there may be a “novel kind of antisense 
translational control mechanism" and that “lin-4 may 
represent a class of developmental regulatory genes that 
encode small antisense RNA products”.304 

They were not to realize how prophetic these words 
would be (Chapter 12). 


CURIOSITIES OR EMISSARIES? 


The general significance of this finding was, once again, 
not recognized at the time. tRNAs were for a long time 
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FIGURE 9.7 Top panel: Expression of Myadm coding sequences in the interstitial cells of the developing mouse testis and the 


3'UTR in the nuclei of Sertoli and germ cells in the testis cords. 


Bottom panel. Expression of the K1h131 3'UTR but not coding 


sequences in the cerebellum and hippocampus, with close-up of the hippocampus showing especially strong expression in the 
dentate gyrus. (Reproduced from Mercer et al.2% with permission from Oxford University Press.) 


considered the “smallest biologically active nucleic 
acids known"??? Given the “incredible” small size of 
these RNAs and the lack of obvious homologs outside 
of worms, even the groups of Ruvkun and Ambros 
saw them as a "curiosity" of worms, comparable to the 
"gene regulatory vignettes" of small bacterial regula- 
tory RNAs and the few known eukaryotic non-coding 
RANA s.304,306,307 

Not only was it novel that such tiny RNAs could 
have regulatory properties, it was also surprising that 
they originated from an intron and targeted non-coding 
regions in mRNAs, at a time when the regulatory rele- 
vance of 3’UTRs was still being established.?!! However, 
this did not disturb the prevailing conceptual framework. 
According to Ambros, “there was no theoretical need 


to explain existing phenomena in terms of new mecha- 
nisms or new classes of molecules. Transcription factors- 
mediated regulation of cell fate was a successful model to 
account for developmental biology”.307 

A 1994 editorial in Science highlighted these emerg- 
ing findings, the various interpretations and Mattick's 
hypothesis, remarking on the existence of “too many 
cases of odd RNAs" and speculating that “there might 
be a whole family of regulatory RNAs"? Similarly, an 
editorial in Nature posed the question: “Are these RNAs 
all grotesque deviants, one-of-a-kind aberrations, like 
characters in a Fellini film?" and admonished “But pay 
close attention to them. [They may] instead have been 
the first emissaries from an unexplored and vast RNA 
world"?! 
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FIGURE 9.8 (a) Northern blot showing the small ~22nt RNA (lin-4S) produced by the lin-4 gene and its precursor (lin4-L) in 
wildtype C. elegans, absent in a deletion mutant. (b) The sequences lin-4S and lin4-L, the latter showing the predicted secondary 
stem-loop structure. Sequences complementary to the lin-14 3'UTR are bold. (c) The complementarity between lin-4 and seven 
copies of a repeated element in the 3'UTR of lin-14 RNA that is conserved in C. elegans and C. briggsae. (Reproduced from Lee 


et al. with permission from Elsevier.) 
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GENOME MAPPING 


The prelude to genome sequencing was genome mapping, 
physically for microbial genomes, which generally range 
from 400kb to ~10 Mb, and, initially, genetically for ani- 
mal and plant genomes, which are orders of magnitude 
larger (Chapter 7). 

Physical mapping was performed by cleavage of 
genomic DNA with restriction endonucleases with rare 
recognition sites and electrophoretic size separation of 
the resulting fragments, which were then ordered into 
linear or circular maps by partial or sequential diges- 
tion and DNA cross-hybridization. Genetic markers were 
integrated into these maps in well-studied bacterial spe- 
cies, such as E. coli (4.7 Mb)! and Pseudomonas aerugi- 
nosa (5.9 Mb) as well as in the brewing and baking yeast 
Saccharomyces cerevisiae, which also has a relatively 
small genome enabling relatively accurate estimation of 
its size? Maps were also developed for other well-studied 
fungi, notably the ‘fission’ yeast Schizosaccharomyces 
pombe (which is also used in brewing?) and Neurospora 
crassa. 

Genetic mapping of large genomes based on the fre- 
quency of co-inheritance of linked markers was pio- 
neered in Drosophila in the early part of the 20th century 
(Chapter 2) and well advanced by its end. Genome maps 
based on linkage analysis were also constructed for other 
widely studied species, such as maize, rodents, cattle and 
the model flowering plant Arabidopsis thaliana, in which 
mutants were first described in 1873 (see^) and which 
gained wide currency from the 1950s and 1960s, espe- 
cially when it became obvious that, by plant standards, it 
has an unusually compact genome. 

However, physical mapping of genomes of complex 
organisms at high resolution was a monumental task 
beyond the capability of any laboratory or consortium 
before the 1980s. The restriction fragment pattern was 
far too complex for any individual segment to be resolved 
and identified — a blur on electrophoretic display — until 
these genomes could be partitioned by cloning. 


^ Pombe is the Swahili word for beer. 
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GENETICS AT GENOME SCALE 


In the 1970s, Hogness, Welcome Bender and colleagues 
constructed libraries of randomly cloned large inserts that 
encompassed the entire Drosophila genome (Chapter 6). 
This allowed physical restriction enzyme maps to be 
developed for individual segments and genomes to be vir- 
tually assembled by “chromosomal walking” across over- 
lapping segments. It also allowed the screening of these 
libraries for specific sequences by ‘colony hybridization’ 
and the first ‘positional cloning’ of a gene, Ultrabithorax, 
by mapping a disruptive inversion, an approach that was 
then extended to many other alleles and genes in which 
mutants had arisen by chromosomal breakages or trans- 
poson insertions.>~° 

In 1980, Christiane Nüsslein-Volhard and Eric 
Wieschaus undertook systematic genome-wide screens 
for genes involved in Drosophila development, which 
led to the discovery of the components of the major sig- 
naling pathways, many of which then turned out, like 
Ultrabithorax, Polycomb and Trithorax, to have homo- 
logs in other animals, including mammals. These new 
genes were then isolated and characterized by positional 
cloning and transposon tagging, '? an approach extended 
to other species including Arabidopsis and mice. 

Similar large insert libraries and physical maps of 
chromosomes were developed for many other organ- 
isms and used extensively for the mapping of mutations, 
especially those causing human genetic disorders, using 
restriction site polymorphisms as guideposts to track 
alleles in affected families (Chapter 11). They were also 
used as the platforms for whole genome sequencing proj- 
ects prior to the introduction of highly parallel random 
sequencing and assembly approaches. 

In the 1980s, Martin Evans, Oliver Smithies and Mario 
Capecchi developed methods for constructing transgenic 
mice using retroviral vectors and homologous recombi- 
nation in embryonal stem cells, ^-^ which permitted the 
introduction of specific mutations to examine their con- 
sequences and the rescue of mutant phenotypes by gene 
transfer.'é?! These approaches were further extended by 
“enhancer traps' to screen for genes based on their pattern 
of expression”? and systems for ectopic expression.? 


111 


112 


WHOLE GENOME SEQUENCING 
OF BACTERIA AND ARCHAEA 


While some viral and organelle genomes had been 
sequenced,” the sequencing of organismal genomes was 
made feasible by the development in the mid-1980s of 
fluorescently labeled oligonucleotide sequencing prim- 
ers by Leroy Hood and colleagues, which enabled optical 
(laser) reading of electrophoretic displays of fragments 
generated by the Sanger chain termination method and 
the consequent development of highly parallel automated 
DNA sequencers” (Chapter 6). 

Using this technology, the first whole genome from 
an organism to be sequenced was that of the bacterium 
Haemophilus influenza by Craig Venter, Hamilton Smith 
and colleagues in 1995,” who devised a strategy of 'shot- 
gun' cloning and sequencing to avoid the tedious work 
of mapping a genome and the difficulty of coordinating 
laboratories working on different parts of it. Venter and 
Smith rationalized that it was easier, at least for small 
genomes, to sequence random fragments en masse, and 
then assemble a continuous sequence in silico by match- 
ing overlaps, called *contigs' a logical extension of their 
shotgun sequencing of human cDNAs.”8 

And so it proved, and the sequence of 1.83 Mb H. 
influenzae genome was completed 2 years before that of 
E. coli (albeit having a larger genome, 4.6 Mb”), whose 
sequencing was begun earlier. This was the first time that 
molecular biologists were able read the entire genome 
sequence of a living cell, identify all of the protein-cod- 
ing genes (Figure 10.1) and start to understand its genetic 
programming and evolutionary history holistically, 
exemplified by the insights gained from sequencing the 
intracellular parasite Rickettsia prowazekii, the causative 
agent of epidemic typhus.?? 

These studies revealed many previously unknown 
features of bacterial genomes, including remnants of 
bacteriophages — which proved the signpost to a spectacu- 
lar new technology for genetic engineering (Chapter 12) — 
and other sequences that suggested genome evolution 
and plasticity through transposition and horizontal DNA 
exchange, often using bacteriophages as the vehicle. 

Thousands of bacterial and archaeal genomes have 
since been sequenced and deposited in the public data- 
bases. The data reveal that prokaryotic genomes, while 


^ The human mitochondrial genome is only ~17 kb (sequenced in 
1981 by Sanger and colleagues”); the tobacco chloroplast genome is 
~156 kb, and its sequencing by two Japanese research teams? in 1986 
was a tour-de-force at the time. 

This approach is easier in bacteria because of the low frequency of 
repetitive sequences whose locations are ambiguous. 
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encoding some short regulatory RNAs (Chapter 9), are 
comprised, in the main, of protein-coding genes, sepa- 
rated by short regions that contain cis-acting transcrip- 
tional and translational control sequences.‘ Nonetheless, 
their genomes collectively encode extraordinary pro- 
teomic diversity and fluidity of gene content, reflecting 
their range of ecologies from commensal pathogens to 
deep ocean volcanic vents and industrial waste.???6 

For example, most E. coli strains contain between 
4,000 and 5,000 genes,* but only 20% of the genes in a 
typical E. coli genome are shared among all strains,?” 
which is the core proteome that defines the species, 
whereas the total number of different protein-coding 
genes observed in different strains exceeds 16,000.3438 A 
recent analysis of 303 million bacterial genes from 13,174 
publicly available metagenomes showed that most genes 
are specific to a single habitat and that the majority of 
species-level genes and protein families are rare.’ That 
is, phenotypic diversity in prokaryotes, primarily meta- 
bolic and ecological versatility, is achieved by varying 
the proteome. 


GENOME SEQUENCING OF 
UNICELLULAR EUKARYOTES 


The first eukaryotic genome to be sequenced was the 12.4 
Mb genome! of S. cerevisiae, achieved chromosome- 
by-chromosome by an international consortium led by 
André Goffeau in 1996, which identified almost 6,000 
protein-coding genes and 140 rRNAs, 40 snRNAs and 
275 tRNAs ^? 

The S. cerevisiae genome contains only a few (-270) 
short introns (average 247bp), located in just 4.596 of 
protein-coding genes, ? deletion of which was later shown 
to have physiological and growth effects,*^-^" i.e., even 
these small introns contain information. Interestingly, 
the number of introns in S. cerevisiae is substantially 
lower than in the superficially similar $. pombe, whose 


d Prokaryotic genomes range in size from just 160 kb in the insect sym- 
biotic bacteria Carsonella ruddii*! and Nasuia deltocephalinicola? 
(and are generally small in endosymbiotic and obligate intracellular 
parasitic species, such as Mycoplasma) to 14.8 Mb in the free-living 
soil bacterium Sorangium cellulosum,? with ~11,500 protein-coding 
genes, which appears to be close to the upper limit? (Chapter 15). 

© The standard laboratory strain of E. coli (‘K-12’) has ~4,300 protein- 
coding genes.” 

f At the time of writing, the smallest known eukaryotic genome is 
that the microsporidian parasite Encephalitozoon intestinalis (2.3 
Mb). The largest known plant genome is that of the monocot Paris 
japonica (-150 Gb). The largest known animal genome is that of the 
genome of the marbled lung fish, Protopterus aethiopicus (-130 Gb) 
(Chapter 7).% 


Genome Sequences and Transposable Elements 


1800000 
Sma | 


Rsr Il 


1500000 


1400000 
Sma | 


1300000 


Sma | 
1200000 


FIGURE 10.1 


113 


300000 
Rsr || 


900000 


Circular map of the H. influenzae genome illustrating the location of key restriction sites (outer perimeter), color- 


coded predicted coding regions (outer concentric circle), regions of high and low G+C content (inner concentric circle), coverage 
of clones used to generate the sequence (third concentric circle), locations of the six ribosomal operons (green), tRNAs (black) and 
the cryptic mu-like prophage (blue) (fourth concentric circle), simple tandem repeats (fifth concentric circle), the putative origin 
of replication at the outward pointing green arrows and putative termination signals (red). (Reproduced from Fleischmann et al.” 
with permission of the American Association for the Advancement of Science.) 


13.8 Mb genome has fewer protein-coding genes (~4,900) 
but many more introns (74,700, in 40% of protein-coding 
genes), albeit still mainly small (30—800bp, modal length 
48bp).* 

Both S. cerevisiae and S. pombe have proven invalu- 
able in the genetic dissection of the control and mechan- 
ics of cell division and other basic processes such as 
protein trafficking, and the identification of homologous 
genes in plants and animals, including human.*%% Indeed, 
genome analyses showed that the protein components of 
core cellular processes have been conserved throughout 
eukaryotic evolution. 

Large numbers of yeasts (including such pathogens as 
Candida albicans, which causes thrush) have now been 


sequenced, showing, for example, that many brewing 
yeasts are hybrids of S. cerevisiae and S. eubayanus, and 
that both originated in East Asia, with different strains 
being domesticated in different places.^^»? 

The genome sequence of N. crassa was published in 
2003 and found to contain ~10,000 protein-coding genes, 
with introns and intergenic regions occupying ~56%. 
About 10% of the genome is comprised of repetitive 
sequences.” 

And, of course, the genomes of important parasites, 
such as Plasmodium falciparum, which causes malaria, 
were also soon sequenced,** along with that of its mos- 
quito host in 2002.5 The richness of the information in 
these genome sequences is extraordinary, and still being 
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investigated. To do so has required the construction and 
maintenance of databases and the acquisition of bioinfor- 
matic tools and skills, a major change to the investigative 
landscape of all biological domains, from evolution to 
neurobiology. 


GENOME SEQUENCING OF MODEL 
PLANTS AND ANIMALS 


The first (nearly) complete sequence of the genome of any 
multicellular organism was that of C. elegans, accom- 
plished in 1998 by a consortium led by John Sulston,? 
who (with Sydney Brenner and H. Robert Horvitz) pio- 
neered it as an experimental organism.?^?* C. elegans has 
only ~ 1,000 somatic cells, whose ontogeny had been 
determined, and has been a useful model for many pro- 
cesses including cell differentiation (Chapter 15), RNA 
interference (Chapter 12), transgenerational inheritance 
(Chapter 17), drug responses such as nicotine with- 
drawal®® and aging,?' among others. The C. elegans 
genome was found to be 97 Mb in size and to contain 
just over 20,000 protein-coding genes, “one-fifth to one- 
third the number predicted for humans”.% Seventy-three 
percent of the C. elegans genome is comprised of introns 
(26%) and ‘intergenic’ (47%) sequences (Figure 10.2). 

Two years later, the sequence of the Drosophila mela- 
nogaster genome was completed by a consortium led by 
Venter and Jerry Rubin.? This time the approach was 
different, not sequencing of previously mapped cloned 
segments, as was the case with C. elegans, but mainly 
sequencing of random fragments, as had been done with 
H. influenzae. This achievement put paid to the skepti- 
cism that many had expressed of this approach because of 
the problem of repetitive sequences in genome assembly; 
shotgun sequencing is now the standard method. 

The Drosophila genome is ~120 Mb in length and 
encodes only ~13,600 protein-coding genes, many of 
which have equivalents in humans, only twice the num- 
ber of protein-coding genes in yeast and fewer than in 
C. elegans, which is developmentally far simpler than 
an insect.? This anomaly was noted in a commentary 
at the time: “... there is little relationship between total 
gene number, neuron number, morphology and behav- 
ioral capacities of diverse organisms in different phyla 
... (which) merely highlight our ignorance of biological 
complexity and how it is instantiated." 64 

By contrast, the Drosophila genome contains a greater 
proportion than C. elegans of introns (241,000 ranging 
up to 70kb in length, for example, in the bithorax and 
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DMDs genes) and ‘intergenic’ sequences, which collec- 
tively comprise ~80% of the genome, one of the first hints 
from genome sequencing that increased developmental 
complexity is not a function of the number of protein- 
coding genes, but rather of information in non-coding 
regions. 

In the same year (2000), the first plant genome 
sequence was also published, that of Arabidopsis thali- 
ana, which has one of the most compact plant genomes 
known (125 Mb), similar in size to that of Drosophila, 
but contains almost twice as many protein-coding genes, 
-25,500.6 The genome sequences of two rice cultivars 
(~450 Mb; 30—50,000 protein-coding genes) were pub- 
lished in 2002.6970 

A “first draft” of human genome sequence was pub- 
lished in 2001,172 and a more complete compendium 
in 2004? (Chapter 11), revealing the full extent of the 
complement of sequences derived from transposable ele- 
ments, other repeats, introns and ‘intergenic’ regions. 
The mouse genome was published in 2002, followed 
soon by the sequences of rat,” dog,” cow," chimpanzee,’® 
chicken” pufferfish,?? the ascidian Ciona (a primitive 
chordate)?! and many others. 

These studies revealed high conservation of the 
protein-coding gene complement among vertebrates 
(-20,000 protein-coding genes, 7596 orthologous between 
fish and human), and especially mammals (~90% orthol- 
ogous),? with lineage-specific expansion or contraction 
of some gene families such as those encoding cytokines 
and olfactory receptors.5?-55 

The genome of the pufferfish (Takifugu ruprides, 
often referred to simply as Fugu) was sequenced because 
it is unusually compact (just 365 Mb, an order of mag- 
nitude smaller than in human, but three times bigger 
than in Arabidopsis), and held up as a model of a stream- 
lined vertebrate genome with minimal “junk”. The Fugu 
genome contains 11% protein-coding, 22% intronic and 
67% intergenic non-coding DNA, 17% comprised of 
repetitive sequences.*? 


THE G-VALUE ENIGMA 


The unexpected finding from the genome projects was 
the lack of correlation between the number of protein- 
coding genes and developmental complexity.5 Up until 


= Due to its exceptionally large introns, DMD is one of the largest 
protein-coding genes not only in Drosophila® but also in Fugu® and 
human,” suggesting that both the exons and the introns of this gene 
have been conserved. 
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FIGURE 10.2 C. elegans and map of its genome: Distributions of predicted genes (pale blue); EST matches (green); yeast 
protein similarities (dark blue); and inverted (purple), tandem (red), and TTAGGC repeats (black) along each chromosome. 
Numbers are Mb. (Reproduced from The C. elegans Sequencing Consortium? with permission of the American Association for 
the Advancement of Science. C. elegans image by K. D. Schroeder made available under Creative Commons Attribution-Share 
Alike 3.0 Unported license.) 
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this point, gene number had been widely proffered to be 
a valid measure of biological complexity?! — and may still 
be, if the definition of a 'gene' is extended to those encod- 
ing regulatory RNAs (Chapters 12, 13 and 16). 

To recap, C. elegans, a simple nematode with only 
~1000 somatic cells has ~20,000 protein-coding genes, as 
has its sister species C. briggsiae.5%85-% Sponges, the most 
basal metazoans, have ~30,000 protein-coding genes.” 
The far more complex insect Drosophila has ~13,600 pro- 
tein-coding genes, mosquitos have -16,000,%%2 whereas 
the water flea Daphnia has ~30,000, the increase in the 
latter apparently related to ecological flexibility rather 
than developmental complexity.” 

Humans have ~40 trillion cells sculpted into a myr- 
iad of different muscles, bones and organs with complex 
architectures?^ as well as a brain with approximately 
85 billion neurons (Chapter 15),*% but just ~20,000 
protein-coding genes, similar to C. elegans and other 
mammals.?9-1?? Indeed, despite fluctuations, the number 
of protein-coding genes remains remarkably static across 
the animal kingdom, despite enormous differences in 
developmental complexity and cognitive capacity.105.104 
Moreover, the majority of protein-coding genes in ani- 
mals are orthologous, including most of those involved in 
multicellular development and brain function.?!! That 
is, all animals have a similar protein toolkit. 

On the other hand, in contrast to lack of scaling of 
protein-coding genes, the fraction of the genome that is 
intronic and ‘intergenic’ increases with developmental 
complexity, crudely defined as the number of different 
‘cell types’ (Chapter 7),1%-1% although this definition 
underestimates the different spatial identities, architec- 
tures and ontogenies of functionally similar (e.g., muscle 
or bone) cells (Chapter 15). Prokaryotes have -10%-15% 
non-protein-coding sequences, mainly specifying cis-reg- 
ulatory elements controlling transcription and translation. 
The non-coding fraction of the genomes of unicellular 
eukaryotes (protists) generally lies in the range of 40%-— 
50%, fungi 50%-60%, plants 70%-90%, and animals 
mostly in excess of 90%, with the human genome having 
98.8% non-coding DNA (Figure 10.3).103,104 

Clearly the majority of the information that orches- 
trates developmental programs and phenotypic diversity 
lies in the non-protein-coding regions of the genome, 
which raises the questions of what form the information 
takes and how is it transduced? The conventional view has 
been that it involves the combinatorics of cis-regulatory 
protein-binding sites, more complex post-translational 
modifications, and expansion of the range of protein 
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isoforms by alternate splicing," all of which requires 
additional regulatory information**!5!^ (Chapter 15). 
However, cis-regulatory elements cannot conceivably 
occupy more than a small fraction of gigabase-sized 
vertebrate genomes (recently estimated to be -7%!!5). 
On the other hand, the high-throughput RNA sequenc- 
ing that followed on the heels of genome sequencing 
revealed that the non-coding regions of animal and plant 
genomes express thousands of regulatory RNAs in dif- 
ferent cells and tissues at different developmental stages 
(Chapters 12 and 13). 


COMPARATIVE GENOMICS AT 
NUCLEOTIDE RESOLUTION 


Since that time, there has been a myriad of comparative 
analyses of genomes. An early example was the compari- 
son of 12 genomes in Drosophila phylogeny, which iden- 
tified, among other things, many putatively non-neutral 
changes in protein-coding genes, non-coding RNA genes 
and cis-regulatory regions, high conservation of shared 
microRNA sequences, including target mismatches, and 
adaptive evolution of lineage-restricted microRNAs!!é 
(Chapter 12). 

Another unexpected finding of the comparison of 
vertebrate genomes was the discovery of several hun- 
dred “ultraconserved” elements >200bp (UCEs) that are 
identical between the human, mouse and rat genomes, 
none of which are protein-coding.''® A subsequent analy- 
sis requiring identity of sequences >100bp between any 
three of five mammalian genomes (human, rat, mouse, 
dog and cattle) identified almost 14,000 such UCEs,'? 
the vast majority of which are not protein-coding, and 
showed that they evolved rapidly, presumably under 
positive selection, between fish and amniotes, but then 
became essentially frozen, subject to fierce negative 
selection in birds and mammals.11912 


h The extent to which alternative splicing expands the proteome may 
be limited. While it is clear that alternative splicing can increase 
greatly the number of isoforms of particular proteins, such as in 
the classic example of the Drosophila Dscam (Down syndrome cell 
adhesion molecule) gene, which expresses over 38 thousand distinct 
mRNAs),1%110 much of the alternative splicing in mRNAs, especially 
in humans, occurs among 5' non-coding regulatory exons, not within 
the body of the protein-coding exons,!!! and a large proportion gener- 
ates non-coding transcripts (Chapter 13). Most protein-coding genes 
express a single dominant splice isoform.!? 

The sequences of 101 Drosophilid genomes have recently been 
published." 


Genome Sequences and Transposable Elements 


117 


Complexity (distinct cell types) 


10" 


10° 


10° 


107 


Coding or non-coding base pairs (log,,) 


105 


10° 


10 
10% 10° 107 


27 - 150 
120-170 


10% 


105 10% 
Genome size (bp, log,,) 


FIGURE 10.3 The relationship between biological complexity and genome composition. The y-axis shows the amount of 
protein-coding sequence (red) and non-protein-coding sequence (blue), which together comprise the total genome size (x-axis) 
in 76 organisms across the phylogenetic spectrum encompassing 23 species of bacteria, 7 protozoa, 9 simple and complex fungi, 
14 plants (including Chalmydomonas, the green alga Volvox carteri, Arabidopsis, rice, maize and grape), 9 invertebrates (includ- 
ing sponge, C. elegans, Drosophila melanogaster and the ascidian Ciona intestinalis) and 14 vertebrates (including the pufferfish 
Takafugu rubripes, zebrafish, frog, the lizard Anolis carolinensis, chicken, mouse, cow, dog, and human). The number of different 
cell types in each organism is taken from'% as indication of developmental complexity. In unicellular organisms, protein-coding 
sequences dominate but that the proportion of non-coding sequences increases relative to protein-coding sequences, intersecting 
in simple multicellular organisms, following which the protein-coding sequences remain relatively constant, whereas the extent of 
non-protein-coding sequences increases exponentially. (Reproduced from Liu et al.!°*) 


Each UCE is different but has followed the same evo- 
lutionary trajectory and so presumably there is some 
commonality of function. At least some are derived from 
retrotransposons.?! They are far more conserved than 
those specifying protein-coding sequences and rRNAs, 
which are highly constrained by structure and multilat- 
eral RNA-RNA and RNA-protein interactions. Many 
are enriched in the vicinity of developmental genes and 
appear to overlap developmental ‘enhancers’ (Chapters 14 
and 16) especially in the brain, and many are transcribed 
into non-protein-coding RNAs, with highly specific 


expression patterns that are perturbed in cancers and 
other diseases.?!-?! UCEs are also dosage-sensitive.!32,133 

However, in contrast to their extraordinary pan-amni- 
ote conservation, deletion of four UCEs that function as 
enhancers in transgenic assays showed no overt devel- 
opmental perturbation!* and insertion of sequences into 
UCEs made no change to enhancer activity,'2 although 
cognitive phenotypes were not examined. Subsequent 
deletion of UCEs in the vicinity of the neuronal transcrip- 
tion factor Arx also resulted in viable and fertile mice, but 
showed subtle neurological or growth abnormalities. 
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Recent results show that the ultraconservation of enhanc- 
ers is not necessary for their function,'*° and the reason 
for the fierce conservation of UCEs in birds and mam- 
mals remains a mystery." 

Reciprocally, comparative analyses also identified many 
RNA genes that have been subject to positive selection in 
hominid evolution.!51% One of the most rapidly evolving 
sequences in the human genome lies within a gene (HAR/F) 
specifying a highly structured non-protein-coding RNA 
expressed in Cajal-Retzius neurons during embryonic 
development of the neocortex,?5 a six-layered structure 
that is far larger and more complex in humans than in 
other mammals, including Old World monkeys.^*^! Only 
two nucleotide changes have occurred in the 118bp HARI 
sequence between chickens and chimpanzees, but there 
have been 18 changes in the human sequence since our 
split from the latter.!% A number of such “human acceler- 
ated regions" regulate dosage-sensitive neural genes, acting 
as enhancers and/or expressing regulatory RNAs, muta- 
tions in which disrupt cognition and social behavior.!2-145 
Moreover, many primate-specific RNAs, including 'repeat- 
derived’ long non-coding RNAs, are involved in a variety 
of developmental, physiological and cognitive processes 
(see below and Chapter 13).1%6-156 


PSEUDOGENES AND RETROGENES 


Large numbers of ‘pseudogenes’, rivaling the num- 
ber of protein-coding genes, were also identified in 


Genome size 


A "3 
216b 
@ 225Mb 
3AMb 
| 
\ 
T 
/ x 
/ \ 
| | 
| | 
\ 'ovophila. 
N / melanogaster @ 
Honey bee 
Apis mellifera q 


aa DEER homes sogas sepan Ley papas regar) 
D 250 500 750 1,000 1,250 1.500 1,750 2,000 
Mya 


RNA, the Epicenter of Genetic Information 


genomic data — almost 20,000 in the human genome." 
Pseudogenes are fragments of duplicated protein-coding 
genes and ‘processed’ intronless copies of mRNAs that 
(presumably) have been reverse transcribed and retro- 
posed into the genome (‘retrogenes’), which have been 
interpreted as non-functional ‘molecular fossils’, because 
they contain incomplete open reading frames or dis- 
abling mutations!58-15 (Chapter 7). Curiously, retrogenes 
are found mainly in mammals.!515! At least some have 
been subject to evolutionary selection.!2-16 Many are 
transcribed in specific cells, and several have been shown 
to regulate the expression of their protein-coding coun- 
terparts, with medical implications (Chapter 13).195-175 


TRANSPOSABLE ELEMENTS 


The genome sequencing projects also revealed the reper- 
toire, distribution, age, activity and features of sequences 
derived from transposons and retroviruses, collectively 
referred to as TEs (transposable elements), the dominant 
components of most plant and animal genomes. TEs 
comprise only a small fraction of yeast, slime mold and 
Drosophila genomes, can be almost absent or occupy a 
large fraction of protozoan parasite genomes, and are 
highly variable both in extent and type in vertebrate and 
plant genomes (Figure 10.4).!7!77 

In agreement with Britten's early estimates, nearly 
half, and perhaps as much as two-thirds, of the human 
genome is derived from DNA transposons and from 
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FIGURE 10.4 Distribution of TEs across eukaryote phylogeny. Reference genome size (sea green circles) varies dramatically 
across eukaryotes and is loosely correlated with TE content. Abbreviations: LINE, long interspersed nuclear element; LTR, 
long terminal repeat; SINE, short interspersed nuclear element; DNA, class II transposons. (Figure reproduced from Wells and 


Feschotte!” with permission from Annual Reviews.) 
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short and long interspersed retrotransposable elements 
CSINEs' and ‘LINEs’) and endogenous retroviruses 
(‘ERVs’) that replicate and invade genomic sites via RNA 
intermediates; although most are now quiescent.7174179-184 
A quarter of these TEs correspond to ~ 1.2 million highly 
similar, but not identical, copies of Alu SINE elements” 
(derived from 7SL RNA, Chapter 8) that entered the 
human lineage in three waves during primate evolu- 
tion,9./5? and were the substrate for a massive expansion 
of RNA editing, especially in the brain,!%5-1% (Chapter 
17), with evidence that at least some have been exapted as 
cell type specific enhancers!”! (Chapters 14 and 15). 

Similar numbers and distributions of SINEs (some also 
descended from 7SL RNA) occur in the mouse genome, 
although they are distinct from Alu elements and entered 
the rodent lineage independently.” In both species SINEs 
are clustered in gene-rich regions, especially near promot- 
ers, while LINEs (17% of the genome!**) are concentrated 
in “gene-poor” regions and depleted from promoters,'?? 
indicating different roles (examples in Chapter 16). There 
are hundreds of thousands of LINE elements in mam- 
malian genomes, but much lower numbers in most non- 
mammalian vertebrates,?? although there are exceptions 
(see below). 

The Consortium human genome paper concluded “the 
organization of Alu elements ... suggests that there may 
be strong selection in favour of preferential retention of 
Alu elements in GC-rich regions and that these ‘selfish’ 
elements may benefit their human hosts”! a conclusion 
confirmed by a later study that showed “that Alu and Bl 
elements have been selectively retained in the upstream 
and intronic regions of genes belonging to specific func- 
tional classes ... (with) no evidence for selective loss of 
these elements in any functional class”.194 

Indeed, while sequences derived from and distributed 
by transposable elements have been thought to be largely 
non-functional (Chapter 7), there is a wealth of evidence, 
dating back to McClintock's studies showing transpo- 
sition altering phenotype in maize and the regulated 
expression of TEs in development observed by Britten, 
Davidson and others (Chapter 5), as well as logic,!951% 
that TEs are major sources of genetic innovation.719720! 
They contain and mobilize modular cassettes of (mainly) 
regulatory information to influence phenotype in evolu- 
tionary historical?02216 and real time.209217221 They are 
perhaps the most important mediators of genetic fluidity, 
often called ‘jumping genes’ although most do not fit the 


i Retroviruses and TEs are thought to share an evolutionary relation- 
ship. Similar to retroviruses, ERVs and LINEs encode a reverse tran- 
scriptase and mobilize via an RNA intermediate.!”* 
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traditional concept of a ‘gene’, as McClintock intuited by 
referring to them as ‘controlling elements’ (Chapters 2 
and 5). The electronic analogy is control packets. 


TRANSPOSABLE ELEMENTS AS 
FUNCTIONAL MODULES 


Thousands of human TEs appear to have undergone posi- 
tive selection in the vicinity of developmental genes.?? 
Other genomic regions, mainly non-coding but also asso- 
ciated with developmental regulation, have been refrac- 
tory to transposon insertions.2 A substantial fraction 
of regulatory sequences in humans, including 2596 of 
promoters and many developmental enhancers, contain 
sequences derived from TEs.?^ 3096-4096 of mouse 
and human RNA transcripts initiate within repetitive 
elements,2522 and analysis of approximately 250,000 
retrotransposon-derived transcription start sites showed 
that the derived transcripts are generally tissue-specific, 
coincide with gene-dense regions and often function as 
alternative promoters and/or express non-coding RNAs 
(Chapter 13).22 Some ancient TEs in the vertebrate lin- 
eage contain subsequences that have been retained over 
huge evolutionary distances.121:227-230 

TEs have been shown to be the source of protein- 
coding and non-coding genes or exons,?! 214231237 centro- 
meres,?**2? transcription factors, their binding sites and 
networks,?01,240242 lineage-specific regulatory RNAs and 
tissue-specific developmental enhancers (Chapters 14 
and 16),121,153,155,200,224,237,243-246 promoters and transcrip- 
tion start sites,20020821822573724625! epigenetic control 
modules,?2-232 neocentromeres,?%8 targets for parental 
imprinting,” splice sites,??^2?9! translational controls,?9? 
microRNAs and microRNA targets, RNA nuclear local- 
ization signals?°? and behavioral modifiers.265 

TEs are the building blocks for epigenetic regulation 
and chromatin organization,!°?25>76~6 the senior level 
of the control of gene expression and cell fate decisions 
during development in complex organisms (Chapter 14). 
Many ‘repeats’ are involved in the formation of hetero- 
chromatin, the importance of which was historically 
downplayed, albeit with exceptions?” (Chapter 7) but 
now known to be regulated, inter alia, by KRAB zinc 
finger proteins that bind to TEs*”!?” and other transcrip- 
tion factors?? that have evolved to regulate TE-derived 
regulatory sequences during embryogenesis and neuro- 
nal differentiation?/?7 (Chapters 14 and 17). 

TEs are also a common source of functional domains 
in regulatory RNAs,?^ for example, as modules for 
protein-binding and interaction partners for enhancer 
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action (Chapter 16). They are also prevalent in mRNAs 
of rapidly evolving mammalian-specific  genes.?^? 
Retrotransposon-derived sequences are widely incorpo- 
rated into coding and non-coding transcripts in human 
pluripotent stem cells?? Primate-specific retroviral 
‘enhancers’ (Chapter 14) and associated TE-containing 
non-coding RNAs are required for maintenance of 
stem cell identity and the pluripotency network in 
humans, 53242727 and the majority of primate-specific 
regulatory sequences are derived from transposable ele- 
ments.27 They also occur in the most abundant transcripts 
in the mouse oocyte and regulate gene expression during 
early embryogenesis.?"?7? Numerous retrotransposons 
act as preimplantation-specific gene regulatory elements 
and a mouse-specific retrotransposon is essential for 
mouse preimplantation development??? Developmental 
transitions and cellular stresses increase the expression 
of both human and mouse SINE transcripts, suggesting 
a role in both development and physiology.?*!?9! LINEI 
elements are spliced into non-canonical transcript vari- 
ants to regulate T cell quiescence and exhaustion.222 A 
retrotransposon is also required for small-RNA-induced 
pathogen avoidance memory in C. elegans and horizontal 
transfer of that memory to naive animals.??? 


TRANSPOSABLE ELEMENTS AS DRIVERS 
OF PHENOTYPIC INNOVATION 


TEs underpin many aspects of quantitative trait variation, 
due to their capacity to alter gene expression patterns in 
differentiation and development, and thereby to act as 
drivers of adaptive/regulatory evolution.?!! Bursts of ret- 
rotransposition have been linked with major diversifica- 
tion and speciation events.??2?^ Transposon insertions 
have been associated with developmental innovations 
and transitions in vertebrates,?9?295 including tetrapod 
evolution,??6 tail loss in the apes,” human-specific hip- 
pocampal development,?% the derivation of small breeds 
of dogs from gray wolves?? and the differences between 
Poodles, Boxers and Great Danes.3%0.301 The ‘calico’ 
white coat color with spotting in cats arose through a 
retroviral insertion in an intron that regulates the spatial 
expression of the c-kit gene, which in turn controls mela- 
nocyte differentiation??? Similarly, a transposon-derived 
inverted repeat in an intron of a gene, ‘goldentouch’, is 
commonly associated with color polymorphism in Midas 
cichlid fishes.303 

The Rag recombinase proteins involved in V(D)J recom- 
bination and the signal sequences therein in the adaptive 
immune system of vertebrates are also derived from trans- 
posons,90^30 as are the regulatory networks underlying 
MHC (major histocompatibility complex) expression.3% 
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The regulation of innate immunity has also occurred 
through the co-option of endogenous retroviruses.??7 

The classic textbook example of adaptive microevolu- 
tion of the British peppered moth into a black form during 
the industrial evolution which was widely cited during 
the development of mathematical evolutionary theory 
and the Modern Synthesis (Chapter 2),30830 proved to be 
due to an intronic TE insertion that increases the expres- 
sion of the gene cortex,*!° a member of a conserved fam- 
ily of cell cycle regulators that controls pigmentation 
pattern,?*!! estimated to have occurred in or around 1819, 
when Charles Darwin was 10 years old (Figure 10.5).?? 

Transposon insertions also underlie morphological 
variations of tomatoes?? and the changes in the branching 
structure! that marked the domestication of maize from 
its wild teosinte ancestor,?'^ as well as subsequent flower- 
ing time adaptations that allowed cultivars to be grown 
at higher latitudes.^ Transposable elements change 
the color of grapes?!é and apples?" by insertions in the 
promoters of genes encoding transcriptional activators 
of pigment production. Analogous insertions occurred 
independently in Sicilian and Chinese strains of ‘blood’ 
oranges, where the cold dependency of the pigmenta- 
tion reflects the induction of the retroelement by stress.?!* 
TE mobilization also appears to be a major generator of 
genetic variation in Arabidopsis.?? 

The huge genome sizes of many native and cultivated 
plants, including wheat, maize, apples and onions that 
have been selected in recent history, may reflect their 
greater flexibility to use TE insertions and polyploidy to 
generate phenotypic plasticity,2155317320-33 in the face of 
being rooted to the spot, unlike animals, which can move 
and (consequently have more precise developmental 
requirements and limited re-wiring options. This may also 
explain the extraordinary diversity of phenotypes among 
closely related plants, such as the varieties produced by 
artificial selection of the wild mustard plant Brassica 
oleracea, including broccoli, broccolini, Brussels sprouts, 
white and red cabbage, cauliflower, kale, and kohlrabi, 
all of which are the same species. Some animal lineages 
too — including salamanders and other chordates — may 
have life history (and specific niche/environmental fac- 
tors) affecting and being affected by changes in TE con- 
tent and genome size, with potentially significant impact 
in adaptations and diversity within the lineage.3?4-326 

TE insertions are linked intimately to ‘epigenetic’ con- 
trol of gene activity, notably by methylation?%232 (Chapter 


* The dark form is long thought to be positively selected because it 
provided better camouflage from bird predation in a sooty environ- 
ment (Chapter 2). 

! Altering the pattern of expression of action of a distal regulatory 
'enhancer 3^ 
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FIGURE 10.5 Adaptive evolution by transposon insertion into the first intron of the cortex gene of the British Peppered Moth 
(Biston betularia; panel A) in the early Industrial Revolution, which increases the expression the gene to create the sooty black 
form (Biston betularia f. carbonaria; panel B), presumably to improve camouflage and reduce bird predation. Panel C shows the 
gene structure (insertion in yellow) and detail (panel D) of the class II DNA transposon containing three repeated units, flanked 
by direct repeats resulting from target site duplication (black nucleotides) next to inverted repeats (red nucleotides). Moth photo- 
graphs A,B by Olaf Leillinger (Creative Commons Attribution-Share Alike 2.5 Generic license). (Gene structure C, D reproduced 


from v'ant Hof et al.?'? with permission from Springer Nature.) 


14). The cycling of transposable elements between active 
and inactive states in maize is determined by the meth- 
ylation state of the element??*?*? and genome sequencing 
has revealed the spontaneous insertion of a methylation- 
insensitive TE-derived 'epiallele' in an inbred strain of 
mouse.**! Indeed, careful analysis of the features of TEs, 
which comprise greater than 85% of the maize genome, 
their insertion sites, expression and methylation profiles, 
etc., “reveal a diversity of survival strategies ... with each 
TE family representing the evolution of a distinct eco- 
logical niche ... (and whose impact) is highly family- and 
context-dependent”.5% 

As Nina Fedoroff said in her 2012 Presidential Address 
to the American Association for the Advancement of 
Science: 


I contend that it is precisely the elaboration of 
epigenetic mechanisms from their prokaryotic 
origins as suppressors of genetic exchanges that 
underlies both the genome expansion and the pro- 
liferation of TEs characteristic of higher eukary- 
otes. This is the inverse of the prevailing view that 


epigenetic mechanisms evolved to control the dis- 
ruptive potential of TEs. The evidence that TEs 
shape eukaryotic genomes is by now incontrovert- 
ible. My thesis, then, is that TEs and the trans- 
posases they encode underlie the evolvability of 
higher eukaryotes' massive, messy genomes. ..!! 


As the genomes of more and more species, includ- 
ing those representing key phylogenetic transitions, are 
sequenced, and awareness grows, the focus has changed 
from analyzing the repertoire of protein-coding genes 
to the nature and distribution of TEs. As a prime exam- 
ple, the publication reporting the marsupial opossum 
(Monodelphis domestica) genome sequence emphasized 
innovations in TEs and other non-coding sequences in 
the mammalian lineage, in sharp contrast with the stasis 
in protein-coding sequences, as “an important creative 
force in mammalian evolution"??? 

The 5 Gb genome of Tuatara, the only remaining mem- 
ber of an archaic order that last shared a common ancestor 
with other reptiles about 250 million years ago and is a 
link to the now-extinct stem reptiles from which dinosaurs, 
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modern reptiles, birds and mammals evolved, is 64% 
composed of an amalgam of TEs with both reptilian and 
mammalian features.?? In 2021, the complete sequence of 
the 43 Gb genome of lungfish, the closest living relative 
of the tetrapods, which is 14 times larger than the human 
genome, showed that it is 9076 composed of intergenic and 
intronic TEs, mainly LINE elements, that resemble those 
of tetrapods more than those of ray-finned fish.2%6,334 

Even ‘simple’ repeats (dinucleotide and trinucleotide 
“microsatellites”) or ‘short tandem repeats’, used as mark- 
ers in gene mapping and DNA fingerprinting, have 
been shown to play a role in adaptive radiation,’ be flex- 
ible?! and function in modulating gene expression.339340 
Simple repeats are also associated with quantitative trait 
variation,’ environmental adaptation?? and human neu- 
rodegenerative and neuropsychiatric conditions, 99994344 
likely with intergenerational consequences (Chapter 17). 
The naive idea that 'repetitive' sequences can be a priori 
and collectively dismissed as junk (with a few “excep- 
tions’) is unsustainable in the face of these observations. 

A blind spot in genome analysis and comparative 
genomics, particularly with short-read sequencing, is the 
difficulty of mapping repetitive sequences and segmental 
duplications. A related issue has been the widespread use of 
the *RepeatMasker' program, which masks repeats and low 
complexity DNA sequences, hiding over 50% of the human 
genomic sequence.?^-4 These problems are being relieved 
by advent of long-read technologies (see below and Chapter 
11), such as nanopore sequencing, which drags single mol- 
ecules of DNA (or RNA) through engineered protein pores 
embedded in membranes and measures the disturbance in 
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the electrical current as nucleotides pass through, enabling 
sequencing of much longer fragments than SBS (over 
1 Mb) and direct sequencing of RNA.322347355 


THE GREAT EXPLORATION — 
THE DIVERSITY OF LIFE 


The pace of genomic exploration was given a huge 
boost with new technologies allowing massive paral- 
lelization of the sequencing process. The most success- 
ful to date has been the ‘sequencing by synthesis’ (SBS) 
method, invented by Shankar Subramanian and David 
Klenerman,?6 and later commercialized. SBS uses fluo- 
rescently labeled nucleotides containing reversible ter- 
minators to optically sequence high density clusters of 
PCR-amplified fragments on solid surfaces. SBS, along 
with other technologies, permitted a hyper-exponential 
increase in the volume of DNA sequence data produced 
and reciprocal reduction in cost — at a much faster rate 
than the so-called Moore's Law of computing (at one 
point a ~2-fold increase in capacity / processing speed 
and reciprocal halving of cost every 18 months) - the fast- 
est technology revolution in human history. 

Over the past decade, there has been an explosion 
of genome sequencing across the entire phylogenetic 
spectrum (Figure 10.6). Not only have the genomes of 
tens of thousands of bacterial and archaeal species been 
sequenced,*% but sequencing has become so sensitive 
and efficient that it became possible to sequence and 
deconvolute complex microbial communities (termed 
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FIGURE 10.6 Timeline illustrating the major genome sequencing achievements from the mid-1960s to 2019, placed in a color- 
coded background according to the sequencing approach. Orange, early sequencing methods; yellow: Sanger-based shotgun 
sequencing; green: ‘next generation sequencing’ (NGS) technologies based on sequencing by synthesis;* blue, NGS plus long- 
read sequencing for whole genome assembly. (Reproduced from Giani et al. % with permission of Elsevier.) 
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“metagenomics”), such as those in soil, sea- and fresh- 
water, mining and industrial sites, extreme environments, 
and the digestive tracts of ruminants and humans. 
Indeed this is the only way to characterize the vast 
majority of prokaryotic life on earth, which cannot be 
cultured as (single) colonies on artificial media such as 
agar plates,* although this may be changing, not to 
mention the estimated billion viruses" in every cubic 
meter of the ocean.?66 

A subset of metagenomics is the human ‘microbiome’ — 
the bacteria, archaea, protists and fungi, and their own 
viruses, such as bacteriophages, that inhabit our gut 
and other places (skin, mouth etc), and which vastly 
outnumber our own (human) cells — termed a human 
“supra-organism’’.>°7 

The human microbiome appears to have a large influ- 
ence on health, including metabolic activity, autoim- 
mune and inflammatory disorders, atherosclerosis and 
cancer, 0937 neurodegenerative and neurodevelopmen- 
tal disorders,*75-378 neurotransmitter biosynthesis,579%580 
social development,**!* depression,’ sensory and loco- 
motor behavior,384,385 stress responses,?*6 obesity??? and 
immunity,’ all of which are associated with particular 
types of gut bacteria and bacteriophages. The Human 
Microbiome Project was initiated in 2007.367.389 

A similar exploration of the hugely varied world of 
protists and fungi is also underway and is reshaping the 
eukaryotic tree of life.%% There are far too many projects 
to catalog here, except to say that it is now or soon will be 
unacceptable to study any species or ecosystem without 
sequencing the genomes involved. 

And so it is with plants and animals: for example, 
the 1,000 Plant Genomes Project initiated in 2008?! — 
over 200 angiosperm (flowering plant) genomes and 
over 1000 plant transcriptomes had been sequenced by 
2019;92393 the 10K Vertebrate Genomes Project initiated 
in 2009;3°49 the ‘Earth BioGenome Project’ initiated in 
2018 to characterize the genomes of all of Earth’s eukary- 
otic biodiversity;*% the genomes of hundreds of butterfly 
species;*%7%% and the ‘Zoonomia Project’ to characterize 
the genomes of eutherian mammals, with 131 assemblies 
reported in 2020.5% A comparative analysis of 363 bird 
genomes in 2020 more than doubled the fraction of bases 
that are predicted to be conserved between species and 
revealed extensive patterns of selection in non-coding 
DNA.+00 


? Viruses may be the universal genetic currency, trading information 
across species and kingdom boundaries. They have co-evolved with 
cellular life?9?39* and may have been instrumental in the formation 
of the eukaryotic nucleus.*% 
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Despite the numerous (now mainly computational) 
challenges, genome databases are moving beyond simple 
gene catalogs to encompass the diversity of variations 
(nucleotide substitutions, insertions and deletions, as 
well as structural changes, rearrangements and transpo- 
son insertions), and the presence or absence of particular 
genomic regions in individuals, populations and clades 
(the *pan-genome' of a species), to allow a greater explo- 
ration of genome dynamics and the basis of phenotypic 
diversity.?5.521.401-403 

The examination and comparison of the evolution 
and divergence of genomes and their sequence ele- 
ments+04-407 is an enterprise that will continue for the 
foreseeable future. Analysis of the genomes of extinct 
hominids such as Neanderthals and Denisovans by 
Svante Páábo and colleagues and others is revealing 
the details of recent human evolution, 4905-4? indicat- 
ing that there have been multiple bursts of adaptive 
changes specific to modern humans during the past 
600,000 years involving genomic regions related to 
brain development and function.*' Others are docu- 
menting the diversity in the human population*'* and 
the details of the migrations out of Africa,4^-^? the 
provenance of the biblical Dead Sea scrolls,?! and 
the genomes of extinct megafauna, such as the mam- 
moth?? and cave bear.?? 


FROM GENOME SEQUENCE 
TO GENOME BIOLOGY 


In the years following the completion of the pioneering 
projects, many studies were described as “genome-wide” 
or "global", even though they were limited to protein- 
coding genes or the ‘exonic’ component of the genome, 
again on the assumption that most of the relevant infor- 
mation resides therein. 

Fortunately, other studies extended to the whole 
genome, allowing the discovery of many dynamic 
features outside of coding sequences. The dramatic 
improvement in sequencing technologies enabled the 
implementation of unbiased methodologies to globally 
study the dynamic properties of genomes, including the 
progressive identification of all transcribed sequences 
(the ‘transcriptome’) and the positional modifications 
of histones and DNA (the ‘epigenome’), as well as 
protein-binding sites, chromatin structure and other 
features in different cell types at single cell resolution 
(Chapters 13-14). 
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1 The Human Genome 


THE PROJECT 


The flagship project of the age was, of course, the Human 
Genome Project (HGP). We devote a chapter to it, pri- 
marily in the light of its controversies and controversial 
findings, the interpretation of which bears heavily on the 
understanding of genetic programming and the use of 
genomic information in healthcare. 

The HGP was first mooted at a conference orga- 
nized at the University of California Santa Cruz in 
1985 by the biophysical molecular biologist Robert 
Sinsheimer,! attended by David Botstein, John Sulston, 
Bob Waterston, Leroy Hood, Walter Gilbert and George 
Church, among others. It was formally proposed in 1986 
by the cancer virologist Renato Dulbecco? at a meeting in 
Santa Fe attended by, among others, Sinsheimer, Watson 
and Charles DeLisi from the US Department of Energy 
(DOE)? The HGP was then recommended for fund- 
ing in 1987 by a subcommittee of the Office of Health 
and Environmental Research of the DOE (including 
Sinsheimer, Dulbecco and Hood), supported by many 
luminaries of the time, albeit with reservations.’ The first 
human genome sequencing conference was held in 1989 
at Wolf Trap Farm near Washington.^ 

The project captured the imagination and ambition 
of the US government — the biomedical equivalent of 
the Apollo Space Program — which provided most of 
the funding through the National Institutes of Health 
(NIH) and the DOE, with a large contribution from 
the Wellcome Trust in the UK and support from the 
governments of Japan, France, Germany and China — 
the ‘public’ project. There was also a parallel project 
undertaken by a private company, Celera Genomics 
Corporation, headed by the béte noir of the human 
genetic establishment, Craig Venter. The public proj- 
ect was officially launched in 1990, but there was a lot 
of civil and not-so-civil toing-and-froing before it got 
seriously underway. 

There are three aspects worth recalling. The first is the 
debate about whether to sequence just mRNAs (cDNAs, 
as an extension of Venter's 1995 study) or to sequence 
the entire genome. Why spend all that money of sequenc- 
ing acres of junk?? Moreover, the view of “a surprisingly 
vocal group" was that the project (in any case) was a 
waste of money that would be better allocated to other 
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areas of research or to healthcare, exacerbated by a fear 
of, or antagonism to, “big science"? 
For example: 


Itis doubtful that much of the resulting information 
will provide insights into human diseases or funda- 
mental biological processes ... (repeated sequences 
and introns) serve mainly to space exons or repre- 
sent junk DNA. Obtaining the sequence of these 
genomic regions is, in my view, simply a waste of 
money and effort ... Genome projects should be 
severely curtailed or, better still, abandoned.? 


And Brenner, astride the fence: "If something like 9896 
of the genome is junk, then the best strategy would be to 
find the important 2%, and sequence it first”.!0 

By contrast from Sinsheimer: 


There is currently a facile assumption that only 
1 or 2 or 5 percent of the genome is ‘of interest.’ 
I am not convinced we know that. Surely, in an 
evolutionary sense, much more will be of interest. 
Knowledge of the variability among the genomes 
of individuals will surely shed light on variations 
in physiology and susceptibility to disease, as well 
as on questions of human origin.!! 


Others, Watson? in particular (who was made initial direc- 
tor of the project, and whose genome was the second to 
be sequenced!?), agreed and maintained that the human 
genome could not be understood unless it was sequenced 
in its entirety, including its non-coding elements, what- 
ever their extent and form might be." 

The second aspect was the speculation at the time 
about the numbers of ‘genes’ in the human genome, 
which had declined from early and seemingly ludicrous 
estimates of millions (based on genome size and bacte- 
rial-like gene density; Chapter 5) to somewhere in the 
range of 30,000-150,000.!*-12 As always the underlying 
assumption was that, apart from those specifying infra- 
structural RNAs involved in mRNA splicing and transla- 
tion, and a few others, the gene complement would be 
mainly protein-coding. 

The third was the effort to assemble a coordinated 
international consortium to undertake the project, mainly 


1 Watson's recollections may be found at https://wellcomecollection. 
org/works/m4kr8fz5. 
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at three meetings in Bermuda, co-chaired by the NIH, 
DOE and the UK's Wellcome Trust, in 1996, 1997 and 
1998. They set out the so-called “Bermuda principles”, 
which held that the human genome sequence data 
should be made public immediately, promulgated by 
the Wellcome Trust (and its senior investigators, notably 
John Sulston), which had no external stakeholders to sat- 
isfy, and the NIH, which likely realized that US inter- 
ests would benefit most because of their capacity for fast 
adoption. This initially made life difficult for those from 
other countries, notably Germany, France and Japan, 
whose governments wanted to capture commercial value 
from their investment, but eventually the main players 
prevailed.20? 

The Bermuda conferences also discussed which 
group(s) would take responsibility for, and have prove- 
nance over, the sequencing of specific chromosomes, or 
parts thereof, based on their historical work on mapping 
of genetic disorders, and the resources they had devel- 
oped along the way — based on the clone and map then 
sequence strategy proposed by Gilbert (see 5). Venter 
announced that he would just sequence the whole lot, 
shotgun-style, and assemble the genome from overlap- 
ping ‘contigs’, which was met with a mixed reaction." 

The competition between the ‘public’ and ‘private’ 
genome sequencing initiatives had two interesting con- 
sequences: it spurred the funding agencies, notably the 
Wellcome Trust, to increase their investment in the 
project; reciprocally Celera used the flow of data releases 
from the public project to accelerate the development of 
its draft of the human genome sequence.” 

In any event, as is so often the case, competition was a 
good thing, and the project was completed ahead of time 
and under budget, with an estimated total cost around 
$USD 3 billion. Rapprochement was achieved and in 
2001 ‘first drafts’ of the sequence (totaling ~2.9 giga- 
bases, or ~90%) of a composite genome amalgamated 
from a number of anonymous individuals by the public 
consortium and of Craig Venter’s genome by Celera were 
published contemporaneously in Nature? and Science,” 
respectively. These publications were accompanied by 
fanfare announcements on both sides of the Atlantic by 
then US President Clinton (flanked by Venter and Francis 
Collins, who coordinated the public project as the then 
director of the National Human Genome Research 
Institute) and UK Prime Minister Blair. A more complete 
sequence was published in 2004 (Figure 11.1).% 


> Personal recollection of JSM. 
* https://www.sanger.ac.uk/news_item/1998-05-13-wellcome-trust- 
announces-major-increase-in-human-genome-sequencing/. 


RNA, the Epicenter of Genetic Information 


FIGURE 11.1 
genome. (Reproduced from the Human Genome Sequencing 
Consortium” with permission of Springer Nature.) 


Industrial scale sequencing for the human 


Analysis of the assembled sequences showed that 
just ~1% of the genome is protein-coding, with ~2% of 
the total represented in mRNAs (including the 5’ and 
3'UTRs that control mRNA localization, translation and 
turnover)? whereas 24% is intronic and 74% is “inter- 
genic’ DNA. The genome was found to contain fewer 
protein-coding genes than expected, the initial counts by 
the two camps being 30,000—40,00074 and 26,588 with 
“an additional approximately 12,000 computationally 
derived genes with mouse matches or other weak sup- 
porting evidence ?? 

Even these surprisingly low estimates also turned 
out to be inflated, likely biased by prior expectations. 
The actual number of human protein-coding genes has 
since been revised downward to -20,000,27-2 although 
increasingly offset by growing numbers of genes found 
to express small and large non-protein-coding RNAs?? 
(Chapters 12 and 13). 


ASSESSMENT OF FUNCTIONALITY 


The subsequent publication and comparative analysis of 
the mouse genome sequence in 2002 (almost half of which 
can be aligned to the human genome, with 99% protein 
orthology*) included the estimate that only -5% of the 


* More recent estimates indicate that only 0.7796 of the human genome 
contains protein-coding information, and that exons in mature 
mRNAs occupy 1.74% of the genome.?’ 

* Most protein-coding genes are conserved in vertebrates, with a large 
proportion of proteins shared between human, birds and fish? 
many in all metazoans? (Chapter 10). 


The Human Genome 


sequences in mammalian genomes has been ‘conserved’ 
during evolution, and by imputation is functional. 

The estimate assumed that ancient ‘repeats’ (i.e., 
transposon-derived sequences) that have persisted in 
both genomes since their divergence over 100 million 
years ago are non-functional and can be used to deter- 
mine the rate and distribution of ‘neutral’ evolution of 
unconstrained sequences over time. Applying this esti- 
mate to the remainder of the alignable sequences showed 
that 9596 had diverged to similar extent, with only 596 
diverging more slowly, under evolutionary pressure for 
preservation of particular sequences, termed ‘purifying 
selection’, despite dramatic variations in the ‘neutral’ 
substitution rates across the genome (Figure 11.2).55-38 

The conclusion that most of the human genome is not 
under evolutionary selection and is therefore not functional 
was widely accepted. It supported the orthodox view and 
has remained a central plank of the argument of that most of 
the genome is junk, and is therefore important to address. 

There are several logical problems with the analysis 
upon which this conclusion relies. First, it is entirely cir- 
cular: the assumption that ancient transposon-derived 
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FIGURE 11.2 Comparison of the distribution of sequence 
divergence between the alignable fraction of the human and 
mouse genomes (dark blue), decomposed into a mixture of 
two scaled component distributions: neutrally evolving recog- 
nizable common ancient repeats (red) and sequences imputed 
to be under selection after subtraction of the red distribution 
from the blue distribution (light blue and gray), corresponding 
to approximately 5% of the total, which contains most of the 
orthologous protein-coding sequences (estimated to be about 
1.5%). The remainder is assumed to be conserved regulatory 
elements. Note that if the red curve comprises only the recog- 
nizable highly conserved end of the original distribution, the 
presumed neutral rate of sequence divergence will have been 
underestimated (the red distribution will be shifted left), and 
the proportion of the genome imputed to be under selection will 
be higher. (Reproduced from Waterston et al.% with permission 
of Springer Nature.) 
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sequences that are orthologous in both genomes are non- 
functional was used to justify the conclusion that most 
of the rest of the genome is also non-functional. If the 
assumption is correct (although there was no evidence to 
support it), the conclusion is reasonable. If the assump- 
tion is wrong, then the conclusion is also wrong.^! 

Indeed, this was a questionable assumption given that 
the reference sequences have been retained indepen- 
dently in the mouse and human genomes for over 100 
million years, especially in view of the considerable evi- 
dence of the biological functions of TEs, the known cases 
of which, however, were regarded as exceptions rather 
than examples of a general phenomenon. This increas- 
ingly appears to be incorrect (Chapter 10). 

Second, even if the assumption that most ancient ret- 
rotransposon-derived sequences that date back to the com- 
mon ancestor are non-functional is correct, the analysis 
had an inherent flaw: many of the 'ancient repeats' used in 
the comparison are barely recognizable as being ortholo- 
gous, because their sequences have drifted apart, which 
means there may be, and likely are, an unknown number 
that have diverged further, to the point of being unrecog- 
nizable.?-^ Indeed the mouse genome analysis stated: 


The ability ... to detect (common ancestral) repeats 
was found to fall off rapidly for divergence levels 
above about 3796. If we simulate the events ... the 
proportion of the genome that would still be recog- 
nizable as ancestral repeats falls to only 6%.?* 


Consequently, the rate of (supposed) neutral evolution in 
mammalian genomes, and therefore the extent of their 
functionality, was underestimated to an unknown extent, 
and even a small increase in the true neutral evolution 
rate results in a large increase in the proportion of the 
human genome that is under ‘purifying’ selection.” 
Third, while sequence conservation imputes function — 
highly structured RNAs like rRNAs and proteins are 
constrained by their physicochemical structure-function 
relationships — lack of sequence conservation imputes 
nothing.** Not only do non-conserved, lineage-specific 
sequences underlie evolutionary novelties, regulatory 
sequences (including gene promoters and some enhancers) 
can and indeed do evolve quickly;9^75? like language, 
under different sequence-function constraints and posi- 
tive selection for adaptive radiation.*°*! A high proportion 


f The evolution of language is a useful comparison. The English word 
‘brother’ and the French word ‘frére’ have no obvious homology, but 
not only do both have meaning, they have the same meaning and 
are derived from a common antecedent, having diverged under loose 
sequence-function constraints (sender-receiver recognition, as in 
regulatory circuits).*° 
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of non-coding RNAs and other regulatory sequences, 
including many that have been functionally validated, 
show little sequence conservation, use TE-derived modu- 
lar elements and are lineage-restricted, sometimes with 
only short conserved sequence and structural ‘motifs’ 
embedded in large RNA molecules (Chapters 13 and 16). 

Subsequent studies showed that there are at least seven 
different rate classes of sequence evolution in the human 
genome,*' at least 18% of the human genome is conserved 
at the level of predicted RNA structure,? there is strong 
negative selection across both coding and non-coding 
sequences," and the vast majority of sequence variations 
influencing complex traits and diseases occurs in the 
non-coding regions of the genome (see below). 

A pairwise comparison of eight mammalian species 
concluded that “there is a high rate of turnover of func- 
tional non-coding elements in the mammalian genome, so 
measures of functional constraint based on human-mouse 
comparisons may seriously underestimate the true value ^36 
later reaffirmed by analyzing broader genomic datasets 
in avian lineages. This challenges the use of primary 
sequence conservation and a priori dismissal of repetitive 
sequences in the assessment of genome functionality. 


THE MAJORITY OF THE GENOME IS ACTIVE 


The possibility that the widely held belief that the vast 
majority of the human genome is non-functional may be 
wrong soon became evident in other ways. 
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The large-scale transcriptome sequencing projects 
that followed the genome projects revealed that most 
of the mammalian genome is differentially transcribed, 
producing an extraordinarily complex interlacing suite of 
coding and non-protein-coding RNAs, the latter exhibit- 
ing exquisitely precise expression patterns (Chapter 13). 

The subsequent ENCODE (Encyclopedia of DNA 
Elements") project, alarge international study which aimed 
to identify functional elements in the human genome, 
encompassing RNA expression, the distribution of chro- 
matin modifications (Chapter 14), transcription factor 
binding sites, DNase hypersensitive (exposed) regions, 
promoters, etc.? in different cell types (Figure 11.3), con- 
cluded, in its 2007 ‘pilot’ publication covering 1% of the 
genome, that most of the studied regions exhibited (these) 
biochemical indices of function.?>% 

This figure, however, was at odds with the estimate in 
the same paper (reiterating that from the earlier human- 
mouse genome comparison) that only “5% of the bases in 
the genome can be confidently identified as being under 
evolutionary constraint in mammals" ... and therefore 
that "Surprisingly, many functional elements are seem- 
ingly unconstrained across mammalian evolution"? 

After some internal (pre-submission) debate among the 
authors, a decision was made not to canvas the alternative 
possibility that the estimate of the extent of ‘conservation’ 


* Unfortunately, the project did not include an examination of the inci- 
dence or distribution of alternative DNA structures. 
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FIGURE 11.3 Representative compilation of ENCODE genomic features cataloged on part of human chromosome 22 in 
the GM12878 lymphoblastoid cell line. Annotated (protein-coding) genes and their exon-intron structures are shown at top. 
Chromosome segmentation refers to blocks sharing similar features. Other tracks show predicted enhancers (E, Chapter 14), tran- 
scription start sites (TSS), RNA polymerase II binding sites (Pol II), open chromatin (DNase accessibility), nucleosome depleted 
sequences (FAIRE?5) and the positions of nucleosomes marked with various histone modifications (Chapter 14). (Reproduced 


from Dunham et al.% with permission of Springer Nature.) 
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of the genome- and that only clearly conserved sequences 
are functional-might be incorrect.3%748 Instead, the incon- 
gruity was rationalized as the existence of “a large pool of 
neutral elements that are biochemically active but provide 
no specific benefit to the organism”. This somewhat con- 
tradictory statement became a key talking point and led to 
a wager publicized in Nature as to whether more or less 
than 20% of the human genome is functional, which at 
the time of writing had still not been settled. 

The more comprehensive 2012 genome-wide ENCODE 
paper" confirmed that at least 80% of the human genome 
"participates in at least one biochemical RNA- and/or 
chromatin-associated event in at least one cell type", and 
addressed the conservation conundrum by stating that 


an appreciable proportion of the unconstrained 
elements are lineage-specific elements required 
for organismal function ... and the remainder are 
probably ‘neutral’ elements that are not currently 
under selection but may still affect cellular or larger 
scale phenotypes without an effect on fitness. 


This paper spawned another round of controversy,°!- 
with some apoplectic at the suggestion that a large fraction 
of the genome may be functional, invoking the C-value 
paradox, mutational load and circular conservation argu- 
ments (Chapter 7), while rejecting any suggestion that 
dynamic transcription or differential chromatin modifica- 
tions in non-protein-coding regions might be valid indices 
of genetic function, including in a species- and clade- 
restricted fashion.?9959^ It is clear that some of the antag- 
onism was related to the invocation of junk in genomes 
as a line of argument against proponents of intelligent 
design,“ who seize and misuse scientific ideas and obser- 
vations to try to justify non-scientific, untestable beliefs. 


DAMAGED GENES 


Naturally, the human genome was a major focus of 
genetic mapping by medical geneticists to identify genes 
responsible for serious inherited *Mendelian' metabolic, 
physiological, developmental and/or cognitive disorders.i 


^ The data included identification of ~2.9 million DNase hypersensitiv- 
ity sites, ~580,000 of which could be connected to promoters,?? and 
evidence that over 75% of the genome is differentially transcribed, ? 
identifying many more transcripts than just those encoding 20,000 
proteins and their splice variants (Chapter 13). 

! Most such mutations are recessive, meaning that two damaged cop- 
ies are required for the disorder to manifest, and reciprocally that 
there may be high frequency of heterozygous carriers, especially for 
mutations that may, like cystic fibrosis (1 in 25 carrier frequency in 
Caucasian populations)? and sickle cell anemia (common in tropical 
and subtropical regions), have provided protection in the heterozy- 
gous state against tuberculosis® and malaria,” respectively, a posi- 
tive evolutionary trade-off. 
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These diseases are the result of “catastrophic component 
damage”, i.e., disruptive mutations (mainly) in protein- 
coding sequences, which are generally lethal or severely 
disabling in the homozygous state, and many deleterious 
in the heterozygous state. 

However, of course, controlled breeding was not pos- 
sible to construct conventional genetic maps of the human 
genome to locate and identity damaged genes and, in any 
case, the number of known genes with trackable allelic 
variants that segregated in large families was limited. A 
different approach was needed. 

The solution, proposed by Ellen Solomon and Walter 
Bodmer in 19799 and again by David Botstein, Ray 
White, Mark Skolnick and Ron Davis in 1980,% was 
to take advantage of single nucleotide polymorphisms 
(‘SNPs’) in genomes that resulted in gain or loss of 
restriction endonuclease sites and a resulting change in 
the size of the corresponding fragments. Such restric- 
tion fragment length polymorphisms, or RFLPsi could be 
tracked as surrogate genetic markers by hybridization of 
Southern blots with cloned sequence probes, and linked 
to the inheritance of a condition in extended families.? 

The search was assisted by the construction of 
chromosome-specific cloned libraries by flow cytometry 
sorting of metaphase chromosomes in the early 1980s 
by Kay Davies, Bryan Young, Rob Krumlauf and col- 
leagues,%7 by the use of somatic cell hybrids developed by 
Bodmer and colleagues and Stephen Goss and Henry Harris 
in the 1970s,"6 and ‘radiation hybrid’ mapping developed 
in 1990 by David Cox, Richard Myers and colleagues, 
whereby individual human chromosomes or parts thereof 
can be separated and maintained in mouse cell lines.” 

These approaches were made feasible by high pen- 
etrance disorders, which are easy to trace in affected 
pedigrees, especially if dominant or located on the 
X-chromosome, i.e., commonly exposed in males. On the 
other hand, the difficulty in identifying the causative gene 
was increased by the relatively large genomic regions iden- 
tified by genetic mapping, a needle in a haystack problem. 

One of the complications was that meiotic recombina- 
tion rates across the human genome (and indeed across 
mammalian genomes in general) are not uniform, but 
rather occur at hotspots, between which there is little 
recombinational exchange,*! referred to as ‘linkage dis- 
equilibrium’. The recombination-poor regions between 
hotspots are termed ‘haplotype blocks"? which parse 


! Sequence polymorphisms in intronic sequences that altered restric- 
tion sites were later patented as a means of diagnosing tightly linked 
genetic disorders,” a surprising decision by patent offices in view of 
the long history of linkage mapping. 

* Human genetic data suggests that 60% of recombination events hap- 
pen in 6% of the genome.*? 
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the genome into *HapMaps'***^ that subsequently formed 
the analytical platform for population-scale mapping of 
genetic variations influencing complex traits and multi- 
factorial diseases*? (see below). 

The identification of damaged genes by cloning and 
mapping approaches was, at the time, a tour-de-force, 
achieved by high-resolution mapping of the chromo- 
somes carrying affected genes, and searching for mark- 
ers (i.e., sequence variants) that are co-inherited with the 
condition, aided by homozygosity mapping, since most 
damaged genes are recessive.* Coarse mapping to hap- 
lotype blocks was relatively easy, but fine mapping to 
locate the affected gene within the region, especially in 
the absence of obvious candidates, relied on rare recom- 
binational or deletion events in particular families, which 
were hard to find. 

Eventually the hard grind paid off, culminating 
in the identification in 1986 by Tony Monaco, Lou 
Kunkel and colleagues of the protein-coding gene on 
the X-chromosome that is damaged in Duchenne's 
Muscular Dystrophy (dystrophin).9"-% Dystrophin is 
required for the maintenance of muscle integrity and, 
as noted previously, is one of the largest genes and pro- 
teins! in vertebrates.???! [n 1989, Lap-Chee Tsui and col- 
leagues identified the gene responsible for cystic fibrosis 
on chromosome 7, which encodes a chloride ion trans- 
porter (‘Cystic Fibrosis Transmembrane Regulator or 
“CFTR')2% and explained the symptoms of the disease, 


! Human dystrophin is composed of 79 exons (encoding 3,684 amino 
acids) that account for 0.696 of its 2.4Mb sequence.?? 
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including salty sweat, in both cases providing targets for 
diagnosis and gene therapy.5%%-% 

In 1991, a CGG trinucleotide repeat expansion was 
identified in the SUTR of the FMRI gene (which encodes 
a synaptic protein) in Fragile X Syndrome, the most com- 
mon form of inherited intellectual disability!99./?! (which 
is also associated with autism).' Similar repeat expan- 
sions were subsequently identified in other genes causing 
X-linked or autosomal dominant neurological disor- 
ders such as Kennedy's Disease, Myotonic Dystrophy, 
Huntington's Disease and Spinocerebellar Ataxia,!0?-1? 
which were initially thought to result in defective proteins 
or translation (since many lie in the introns or UTRs) but 
may also be RNA toxicity disorders, an increasingly 
prominent theme in neurodegenerative diseases (Chapter 
16) (Figure 11.4). 

Others followed, and it became easier as the technol- 
ogy improved. 


A PLETHORA OF ‘RARE DISEASES" 


About 396—596 of all children are born with a serious 
physical or intellectual disability due to a mutation 
in a protein-coding gene or a chromosomal abnor- 
mality,''* which are also major causes of miscar- 
riage.!!5.!ó While some genetic disorders, like cystic 
fibrosis and thalassemia, are relatively common, 
most are individually rare, due to damage to any one 
of thousands of protein-coding genes. Collectively, 
however, they account for a high proportion of all 
infant deaths and pediatric hospital admissions, as 
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FIGURE 11.4 Schematic of gene showing repeat expansions that cause neurologic diseases. The differing sizes of the associated 
triangles roughly reflect the range of repeat expansion sizes in each disease. The SCA12* repeat is in an intron. (Reproduced from 


Paulson! with permission of Elsevier.) 
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well as a lifetime burden on survivors, their families 
and health systems." 

Such damaged genes, because of their low allele fre- 
quency in the population and mostly recessive nature, 
often lie silent in family histories, as the incidence of 
homozygosity is low." The high collective frequency of 
defective alleles,'21-123 however, means that at least 1 in 
10 couples are at serious risk of bearing a disabled child 
with every pregnancy, due to the 1 in 4 chance of each 
transmitting to their child a damaged gene that they 
unknowingly have in common. There is also a surpris- 
ingly high frequency of new ('de novo") mutations that 
result in intellectual disability.!24125 

The identification of damaged genes in individu- 
als suffering severe disabilities is now done not by 
genetic mapping (impossible due to their rarity) but by 
whole ‘exome’ or genome sequencing, comparing their 
genome (and usually those of their parents, called a 
“trio”) with a reference, to identify mainly variations 
in protein-coding sequences (or proximal non-coding 
regions, such as splicing signals) that introduce a frame 
shift or stop codon that result in a truncated protein, 
or codon changes that disrupt protein structure and 
function.° 

Exome sequencing? has been favored by many clini- 
cal geneticists and others because of the emphasis on 
protein-coding mutations and because it is cheaper and 
easier to analyze than whole genome sequencing. Its 
diagnostic yield is, however, lower than whole genome 
sequencing because it is limited to annotated exons in 
annotated genes, has technical biases, and is generally 
unable to detect other types of damage such as transloca- 
tions and copy number variations, which can be found 
using other means.!26-28 

Nonetheless the process is becoming increasingly effi- 
cient, supported by databases that have cataloged thou- 
sands of genetic disorders, and sophisticated software 


™ Protein-coding mutations, whose presence may not be evident in 
early life, account for up to 50% of pediatric hospital conditions and 
10%-20% of all hospital admissions, as well morbidity and prema- 
ture death in later life.!-!? Examples of such (damaged) genes are 
those causing familial hypercholesterolemia and cardiac defects, 
which result in catastrophic heart failure in otherwise healthy adults. 
Many adults carry protein-coding mutations that have yet to become 
pathogenic.!% 

Higher in communities that have consanguineous, i.e., mostly first 
cousin, marriages, which increases the odds of homozygosity of 
mutated genes in the children. 

There are many more possible amino acid changes (allelic variants) 
with more subtle effects. 

Exome sequencing is accomplished by oligonucleotide-based 
hybridization capture of known protein-coding sequences, thereby 
removing ~99% of the genome prior to sequencing. 
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that can sift through millions of individual sequence 
variations. It is also being aided by increasing detection 
of 'expressed regions' in transcriptome studies, lead- 
ing to better annotation of both coding and non-coding 
genes, !27.129 

Currently, several genetic conditions are polled at birth 
by the so-called “Guthrie” heel prick blood test, which 
uses biochemical and genetic tests to screen for genetic 
disorders that can be treated by early intervention. The 
prototype, and good example, is the test for phenylke- 
tonuria, a rare recessive disorder whereby infants can- 
not metabolize the aromatic amino acid phenylalanine, 
leading to mental retardation, which can be avoided by 
dietary modification.!% In the near future, it is likely that 
the Guthrie test will be replaced, conditional on paren- 
tal consent, with whole genome sequencing, which will 
provide a much more comprehensive view of incipient 
genetic problems and allow early intervention to prevent 
or mitigate their effects. 

Moreover, the surprise finding that a significant pro- 
portion (~6%) of the DNA circulating in the blood of 
pregnant women comes from the fetus!5!:1% has led to the 
rapid rise of non-invasive prenatal testing (NIPT), which 
can detect chromosomal trisomies (such as trisomy 21, 
Down’s Syndrome) more accurately and with no threat 
to the embryo (unlike the preexisting amniocentesis and 
chorionic villi sampling tests).! This has also contrib- 
uted to the progressive demise of medical cytogeneticists, 
whose other main activity is detecting chromosomal 
translocations in cancer and balanced translocations in 
reproductive failure, also soon to be replaced by genome 
sequencing. 

The identification of mutations causing serious disor- 
ders will inevitably become fully automated, as comput- 
ers match the spectrum of patient sequence variation and 
clinical features with recorded cases, reducing and even- 
tually obviating the need for ad hoc sleuthing by clinical 
geneticists in hospital laboratories.^^ Computerization 
will also allow such information, and recommended 
evidence-based actions based on the latest publications 
and national guidelines, to be delivered to the desktop 
of health professionals, including general practitioners on 
the front line. 


4 Whole genome sequencing can also allow detection of mutations in 
non-coding regulatory RNAs, such that occur in some cases of phe- 
nylketonuria — see Chapter 13. 
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COMPLEX TRAITS AND DISORDERS 


Human genetic analyses have progressively moved from 
protein-centric (such as the classic blood-group and HLA 
allele frequencies) to variable microsatellite loci!55-1% and 
more recently to genome-wide approaches involving very 
large numbers of individuals.!51% Human genomes vary 
by ~0.1%, i.e., unrelated individuals have 4-5 million 
sequence differences, although the total number of dif- 
ferences that occur among humans is many times greater, 
with no absolute differences yet found between popula- 
tions, although allele frequencies vary.!%-1* Studies com- 
paring the congruence or difference between identical 
and non-identical twins showed that genetic factors play 
a substantial part in susceptibility to almost all human 
traits and disorders, including to infectious diseases. ^*^^ 

However, the identification of genetic loci contribut- 
ing to complex traits and diseases is not amenable to the 
approaches used in mapping severe genetic deficiencies 
because each causal locus often only makes a small con- 
tribution to overall heritability.!%%1* The problem was to a 
significant extent solved by the development of haplotype 
maps and oligonucleotide arrays (originally developed by 
Patrick Brown and colleagues for transcriptome analy- 
sis?) that poll common sequence variants.'* SNP arrays 
are cheap to produce and enabled large population- 
scale surveys, termed 'genome-wide association stud- 
ies’ (GWAS), which compare the distribution of sentinel 
SNPs (and by imputation other variants that co-segregate 
in the same haplotype block) to identify variants that are 
statistically over- or under-represented with respect to the 
trait or disease under study.!* The statistical probabilities 
are then graphed across the genome to produce so-called 
‘Manhattan’ plots, with a P-value of ~10-* commonly 
used as a significance threshold.^*.!^? 

The first GWAS was conducted in 2002 on 94 
Japanese individuals who had suffered myocardial 
infarction and 658 controls using (protein-coding) gene- 
centric SNPs, identifying, among others, an intronic SNP 
that enhanced the transcriptional level of the lympho- 
toxin-alpha gene, confirmed in a more focused analysis 
of over 1,000 affected individuals and controls.'*% This 
was followed by a study in 2005 on 96 individuals suffer- 
ing macular degeneration with 50 controls, which iden- 
tified two significantly associated SNPs in an intron of 
a gene encoding a blood complementation factor.^! Two 
years later the Wellcome Trust Case Control Consortium 
published a multilateral GWAS involving 14,000 cases of 
seven common diseases (-2,000 individuals for each of 
coronary heart disease, type 1 diabetes, type 2 diabetes, 
rheumatoid arthritis, Crohn's disease, bipolar disorder 
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and hypertension) with 3,000 shared controls. Other 
studies at the time led to the discovery of many vari- 
ants in non-coding regions regulating the developmental 
expression of human fetal hemoglobin.!53-!55 

Since then, study sizes have grown to millions of 
individuals and have encompassed over 1,000 different 
conditions and traits, usually using ‘biobank’ samples 
integrated by international consortia.!56 

Despite the still-present difficulties in teasing out the 
interplay of genetic variations and heterogeneous social- 
environmental factors in complex traits, the phenotypes 
examined include psychological traits such as tempera- 
ment, neuropsychiatric disorders (such as autism!5815), 
schizophrenia and bipolar disorder,'9%-162 ADHD (atten- 
tion deficit hyperactivity disorder), panic disorder and 
depression,!9-!67 vertigo,'% neurodegenerative diseases 
such as Alzheimer’s and Parkinson's Disease,!6..169-17! as 
well as various types of cancer,'”? immunological disor- 
ders (such as ankylosing spondylitis, ectopic dermatitis, 
asthma and inflammatory bowel disease), ?-"5 hyper- 
tension," height and body mass index,"? bone density 
and osteoporosis,"? alcoholism and other drug depen- 
dences,9-5^ caffeine consumption, handedness,'*6 
insomnia,'*’ aging,'** and even cognitive performance,!*? 
intelligence!%-1% and correlated (and environmentally 
contingent) educational attainment (Figure 11.5).! 

Two general findings emerged from this fleet of 
GWAS, apart from the identification of tens of thousands 
of SNPs/haplotype blocks associated with various condi- 
tions and traits. 

The first is that GWAS does not appear to identify, 
quantitatively, all of the genetic contribution to complex 
traits, traditionally determined by pedigree estimates and 
twin studies, although these are limited by confound- 
ing environmental and methodological factors.14%195-197 
The emblematic example is height, a deceptively simple 
trait that is known to be highly (80%-90%) genetically 
determined after controlling for environmental vari- 
ables such as nutrition,!% implicit in twin studies, but 
where only ~25% of the variance could be accounted by 
GWAS identified loci, of which there at least 180.178199200 
More extensive studies identified several thousand “near- 
independent" DNA markers and rare variants that appear 
to account for 60%-—70% of the genetic contributions to 
height,"529! possibly overestimated due to uncorrected 
stratification.?" Another confounding factor is “hid- 
den epistasis’ (i.e., synergistic interactions between loci 
involving regulatory networks)./95202203 There is similar 
complexity of polygenic contributions to other traits such 
as urate, insulin-like growth factor 1 and testosterone 
levels.204 
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FIGURE 11.5 Combined Manhattan plot of two large genome-wide association studies of education and intelligence. 
(Reproduced from Hill et al.! under Creative Commons Attribution 4.0 International License.) The red line indicates threshold 
for genome-wide significance and the black line the threshold for suggestive associations. The data suggest that genes involved in 
neurogenesis, myelination, expressed in the synapse and involved in the regulation of the nervous system play a role in the varia- 


tion in intelligence. 


In addition to the plethora of environmental factors 
and life histories that can interact with genotypes differ- 
ently, haplotype analysis masks ‘private’ mutations and 
tandem repeat variations that have occurred in individual 
lineages since the divergence of the common versions of 
haplotype blocks, which occurred hundreds of genera- 
tions ago.2% For example, whole genome sequencing of 
families found that 


Variation in height in our sample arises from a 
combination of a small number of QTLs! with 
large effects - which are not tagging previously 
identified common variants, and so cannot be 
imputed from them - and a large number of com- 
mon variants with small effects.206 


Tandem repeat variations make a significant contribu- 
tion to autism,?%7-20% as also do rare mutations that are 
only a few generations old.?'? There is also an unknown 
contribution of transgenerational epigenetic inheritance 
(Chapter 17), which cannot be polled by DNA variants. 
The second general finding from GWAS is that 
(unsurprisingly) the vast majority of genetic variations 
associated with complex traits and diseases, including 
cancer predisposition,'”? occur outside of protein-coding 
sequences, in intronic and intergenic sequences.?!!-219 


* QTL = Quantitative trait loci. 


Although some pleiotropic SNPs (affecting multiple 
traits) occur in coding sequences, UTRs and promoters, 
loci containing multiple-trait associated variants cover 
the majority of the genome.?% A high proportion of the 
imputed loci exhibit the signatures of being (real) genes, 
including promoters characterized by DNase hypersen- 
sitivity, typical chromatin modification signatures and 
transcription.68213221-225 Variation between individuals 
also occurs by recombination between endogenous provi- 
ral sequences," as well as in tandem repeat sequences.2” 

There is enrichment of variations in enhancers 
(Chapters 14 and 16) that are active in disease-relevant 
cell types associated with developmental abnormalities, 
cancers, Alzheimer's Disease, schizophrenia, autoim- 
mune diseases including diabetes, rheumatoid arthritis 
and multiple sclerosis, and cardiovascular disorders. 
Functional analyses are uncovering increasing numbers 
of causal variations, many linked to non-coding RNAs 
transcribed from these 10ci./602283* Indeed, while most 
haplotype blocks identified as being associated with com- 
plex traits and diseases in GWAS studies are devoid of 
protein-coding genes (‘gene deserts'),?623924? most pro- 
duce multi-exonic non-protein-coding RNAs,?3225241- 
243 at least some of which comprise or are candidates 
for the molecular basis of trait association??^292445! 
(Chapter 13). 
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Identifying the relevant variations within haplotype 
blocks among the many differences between individuals 
is a huge challenge but will likely be achieved by analy- 
sis of large datasets of genome sequences, as recently in 
the case of autism.?? These analyses will be informed 
by model organism? studies, RNA expression and 
predicted structural variants, DNA and histone modifi- 
cations in affected tissues (epigenome-wide association 
scans/studies or EWAS) and transcription factor binding 
profiles, to link genomic variants with molecular and 
phenotypic indices.?*25526? 


THE TRANSFORMATION OF MEDICAL 
RESEARCH AND HEALTHCARE 


Frustration has often been expressed at the delay in the 
delivery of health benefits from the HGP, many based on 
promises and expectations that arguably had their roots 
in a century-old 'genes for' mentality, which was boosted 
by successes in the study of monogenic disorders but 
does not reflect the complexity of most human diseases 
and traits. Nonetheless, by 2011 it was estimated that 
there had been a 140-fold economic return on investment 
by the US government in the HGP,?% and, fueled by the 
major scientific advances that the genome sequences have 
made possible, there is renewed interest in harnessing the 
information in whole genome sequences for healthcare at 
individual and population scales. 

The $1,000 human genome sequence cost barrier was 
breached in 2014 and is likely to decline further. New 
competing approaches and technologies are emerging 
and reaching the market, such as long-read sequenc- 
ing and the combination of chromosome conformation 
capture and deep sequencing for chromosome-length 
assembly of large genomes.?*^ Other technologies using 
solid-state devices and high-resolution microscopy may 
not be far away, with the $100 human genome sequence 
in sight. 

The declining cost of sequencing prompted the estab- 
lishment of projects to explore human genetic diversity 
and the etiology of cancer, beginning with the 1,000 
Genomes!*1*1 and the International Cancer Genomes 
Consortium projects,?9?295 followed quickly by the UK 
100,000 Genomes Project, the first to apply population- 
scale genomic sequencing to the diagnosis of genetic 
disorders and cancer,?% completed in 2018. Larger proj- 
ects are underway, with the UK announcing a minimum 
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of 1 million genomes (and an "ambition" of 5 million 
genomes) to be sequenced by the UK Biobank and the 
National Health Service? and 1 million genomes to be 
sequenced by the US ‘All of US’ program, along with 
accompanying clinical and lifestyle data,? with similar 
projects under way in China and many other places. In 
fact, with the accumulation of such studies (with high- 
coverage WGS of very large number of individuals with 
diverse ancestral and admixed backgrounds, deep pheno- 
typing, longitudinal assessment and improved imputation 
methods?”°), the focus is shifting to the dissection of the 
contribution of rare non-coding variants to human phe- 
notypic variation,?!5271272 including polygenic risk vari- 
ant calling for complex diseases??? (an approach that 
is still controversial?”) and pharmacogenomic indices to 
guide drug selection and dose.?”? 

Sequencing of tumor DNA is also revolutionizing 
the understanding and treatment of cancer, showing 
that cancers that arise in different tissues are caused by 
a similar spectrum of mutations.27%78 Most of the main 
“driver” mutations occur in protein-coding genes, such 
as TP53, whereas there are many other non-coding vari- 
ants that also contribute, including previously undetected 
*weak drivers” with aggregated effects on cancer pheno- 
types.266283284 Increasing numbers of the protein muta- 
tions can be treated with targeted drugs?**?8° and, in the 
case of tumors with high mutational load, with immuno- 
therapies, which are proving extraordinarily successful in 
increasing survival.257-289 

Genetic screening is allowing the identification of 
cancer-susceptibility genes and monogenic diseases at 
population level, finding a large number of previously 
unsuspected carriers and individuals with latent disease 
risk.22022 Within a decade or so it is likely, depend- 
ing on the pace of reduction in sequencing and storage/ 
analysis costs, that genomic analysis will become rou- 
tine in the early detection, identification, treatment and 
prevention or mitigation of genetically linked disorders 
and risks, including those usually manifested later in life, 
such as familial hypercholesterolemia, arthritis and can- 
cer. Although the evidence framework and underlying 
databases are still evolving, identification of underlying 
mutations is already leading to improvements in out- 
comes, through either the selection of targeted drugs or 
the likely response to immunotherapies, leading to sub- 
stantial increases in life expectancy and, sometimes, to 
permanent remission.?55.295 


* High-throughput genetic screening of C. elegans orthologs of human 
obesity-candidate genes reported in GWAS identified 17 protein- 
coding loci that are causally linked to obesity across phylogeny.?°? 


t Some non-coding mutations in regulatory regions, such as — 
prominently — in the promoter of the TERT telomerase component, 
also have driver malignant effects.?51,282 


The Human Genome 


Twenty years after the publication of the human 
genome drafts, there is virtually complete coverage of 
entire chromosomes?+2% and a full picture of the diver- 
sity of human genomes (as 'the genome' is a misnomer) is 
emerging,? including the secrets of the highly repetitive 
and heterochromatic regions and the functional impact of 
their variations. 

The acquisition of human whole genome sequences 
at scale will continue to illuminate human biology and 
transform medical research, drug discovery and health- 
care over the coming decades. Millions of genomes, 
accompanied by billions of data points from clinical 
records, self-phenotyping and smart sensors (which 
record real-time physiological and environmental 
parameters), will create a multidimensional information 
ecology that can be mined for new genotype-phenotype 
correlations using machine learning and other methods 
of artificial intelligence, consequently refining patient 
stratification and treatment." Once the infrastructure 


u Machine learning on transcriptome and genomic data in relation to 
cell differentiation stage will also transform understanding of the 
genes and genetic variants controlling development. 
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is in place to analyze and report the consequences of 
genomic variants to clinicians (and patients), medicine 
will change from the art of crisis management to the 
science of good health, and radically improve the qual- 
ity, efficiency and sustainability of healthcare, arguably 
the most important and fastest growing industry in the 
world. 
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At the same time as the genome projects were getting 
into full swing, another revolution was underway, which 
brought into light the natural capacity of RNA to bind 
other RNAs and DNA sequence-specifically to combat 
viruses and to regulate gene expression,!? as first pro- 
posed decades earlier (Chapters 3, 5 and 9). 

Such logical capacity expands enormously the range 
of regulatory options by using RNA to address different 
targets and execute specific actions using a generic infra- 
structure of guidable protein effectors. This is a highly 
efficient and flexible system of gene control, which has 
opened up new avenues for gene modulation and genetic 
engineering. It also lies at the heart of the epigenetic con- 
trol of gene expression during multicellular differentia- 
tion and development (Chapter 16). 


UNUSUAL GENETIC PHENOMENA 
INVOLVING RNA 


The demonstration that RNAs can use base pairing to 
form double-stranded structures dates back to 1956.3 In 
the 1960s and 1970s, numerous reports showed the exis- 
tence of double-stranded RNAs (dsRNAs) derived from 
viral RNA genomes replicating in eukaryotic cells. It 
had also been known since the 1960s that synthetic and 
naturally occurring viral, bacterial and fungal dsRNAs 
can induce antiviral cytokines called ‘interferons and 
inhibit protein synthesis in animal cells at low concentra- 
tions, an important feature of innate immunity. ^ The 
first candidate RNA drug (reported in 1967) was a fungal 
extract that contained “an anti-viral agent ... [which] is 
an RNA of viral origin”.101516 

The discovery of interferons and their induction by 
dsRNAs explained, at least in part, a phenomenon known 
since the 1920s as “virus interference" whereby viral 
infection results in resistance to subsequent infections 
by the same or related viruses.” The mechanism was 


subsequently shown to involve two interferon-induced, 
dsRNA-dependent enzymes that inhibit translation and 
activate an endoribonuclease.??? In 1985, Masayori 
Inouye and colleagues developed an antisense platform 
(‘micRNA’) to regulate host gene expression, which has 
"great potential as a novel cellular immune system for 
blocking bacteriophage or virus infection"?! with Sergio 
Giunta and Giuseppe Groppa suggesting that 


Upon virus infection and exposure to viral nucleic 
acids, the cell synthesizes micRNAs that ... con- 
stitute some of the dsRNAs that are involved both 
in interferon induction and in the cellular meta- 
bolic pathways activated by interferon ... and may 
also be preferentially cleaved by cellular enzymes 
which would account for the still unexplained dis- 
crimination between cleavage of cellular and viral 
mRNAs in interferon treated cells.” 


presaging both the mechanism and the sensing of foreign 
nucleic acids by the Toll pathway.^ 

In the late 1980s, Andrew Fire, Craig Mello and others 
showed that the introduction of DNA constructs containing 
fragments of genes in C. elegans caused suppression of the 
endogenous homologous gene.” As expected, this silenc- 
ing was observed when constructs were designed to drive 
antisense RNA transcription. Surprisingly, however, 
silencing was also observed when designed to drive ‘sense’ 
gene expression, which was expected to do the opposite, 
i.e., increase the mRNA levels.? Adding to this puzzle, in 
1995, Su Guo and Ken Kemphues obtained the same result 
by directly introducing in vitro transcribed RNAs in either 
the sense or antisense orientation, which caused ‘silencing’ 
of the corresponding gene in C. elegans.** 

Similar silencing effects in plants and fungi were also 
reported in the 1990s, when it was established that the 
presence of multiple copies of introduced transgenes 
resulted in repression not only of the transgenes, but also 
of the homologous endogenous genes.* 


a Interferons were first described and named in 1957 by Alick Isaacs 
and Jean Lindenmann, who showed inhibition of influenza virus 
growth by prior exposure of cells to active or heat-inactivated virus.* 
They are now known to comprise three classes of proteins, two of 
which (interferon types I and III) bind to cell surface receptors and 
induce expression of proteins that prevent the virus from replicat- 
ing.>° Interferon gamma (type II) enhances specialized subsets of 
immune responses and is involved in the development of autoim- 
mune disorders such as multiple sclerosis." 


DOI: 10.1201/9781003109242-12 


^ Animal cells also discriminate foreign nucleic acids by their lack of 
nucleotide modifications, in DNA by lack of CpG methylation and 
in RNA by lack of base modifications, sensed by the so-called Toll- 
like receptors,?? which also play a role in development and brain 
synaptic architecture.? The modification of synthetic mRNAs to 
circumvent the innate immune response was a critical step in the 
development of mRNA vaccines.?! 

Sequence-specific transgene-induced virus resistance had also been 
observed?? and appeared to be a natural plant defense mechanism.>”.38 


a 


137 


138 


The iconic example is flower color in petunia, where 
the effect of silencing of endogenous loci by antisense 
RNAs was already well established, as famously illus- 
trated a decade earlier by the changes of pigmentation 
in transgenic petunia flowers expressing antisense RNAs 
targeting genes specifying the anthocyanin biogenesis 
pathway,?-? which appeared to result from “increased 
RNA turnover .? 

In 1990, two groups did the opposite experiment 
but unexpectedly obtained the same result — they over- 
expressed a pigment synthesis gene hoping to produce 
deep purple petunia flowers, but instead obtained pre- 
dominantly white flowers, with remarkable mosaic pat- 
terns, including one dubbed "The Cossack Dancer" 
(Figure 12.1).414445 

These phenomena were variously termed  'co- 
suppression, ‘homology-dependent gene silencing’, 
‘transgene silencing’ (‘quelling’ in fungi)?46-? or *RNA- 
mediated interference’ (RNAi),% and later shown also to 
occur in Drosophila? 

Despite being initially regarded as a peculiar phe- 
nomenon, as early as in 1986 'co-suppression' began 
to be used as a biotechnological tool to modulate plant 
phenotypes,?95325 only a few years after the first trans- 
genic plants had been constructed by the introduction of 


FIGURE 12.1 
*Cossack Dancer' variegated pattern in transgenic petu- 
nia expressing a chimeric chalcone synthase pigmentation 
gene resulting in co-suppression of the endogenous genes. 
(Reproduced from Napoli et al.^^ with permission of Oxford 
University Press.) 


Homology-dependent gene silencing. The 
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foreign genes.* Similar to the puzzling RNA interfer- 
ence observed in worms, it was suggested that the silenc- 
ing caused by introduced copies of 'sense' transgenes in 
plants++5758 might be triggered by “unexpected” promoter 
activity in the antisense orientation.5%% Nevertheless, 
this remained a matter of discussion, as the mechanisms 
behind these phenomena were unknown and appeared to 
be heterogeneous. 

Several models involving RNA-RNA, RNA-DNA 
and DNA-DNA pairing interactions (or RNA-DNA tri- 
plexes*>) were proposed and investigated by the pioneer- 
ing groups, those of Michael Wassenegger, Marjori and 
Antonius Matzke, David Baulcombe, Jan Kooter and oth- 
ers.^561-6 [n some cases, silencing appeared to be caused 
by epigenetic mechanisms; in particular sequence-spe- 
cific RNA-directed DNA methylation*®--® (see below 
and Chapter 14), which was termed ‘transcriptional gene 
silencing' (TGS). 

Other evidence suggested that antisense RNAs inter- 
fered with translation or directed the degradation of 
the target mRNA via dsRNA formation, ?54^557 termed 
*post-transcriptional gene silencing' (PTGS), without a 
supported idea of how ‘sense’ RNAs might result in the 
same outcome, or whether there may be more than one 
RNA-mediated pathway, although there was evidence 
that TGS and PTGS are mechanistically related? 

The finding that the silencing of specific genes was 
able to spread systemically in plants indicated the par- 
ticipation of a diffusible molecule,” likely RNA 
itself,+75464777% with the prediction that there existed a 
“sequence-specific signaling mechanism in plants that 
may have roles in developmental control as well as in 
protection against transposons and viruses”. 

In 1999, Baulcombe and Andrew Hamilton, con- 
sidering the lack of evidence for long antisense RNAs, 
hypothesized that low molecular weight RNAs might 
be involved but had escaped detection due to their small 
size. They showed that short antisense RNAs (estimated 
to be ~25nt) are produced in transgenic or virus-infected 
plants undergoing PTGS when both antisense and target 
mRNAs (cellular or viral) are expressed. They proposed 
that these RNAs acted as guides for PTGS, stressing that 
small RNAs “are long enough to convey sequence speci- 
ficity yet small enough to move through plasmodesmata’, 
and are thus able to act as “the systemic signal and speci- 
ficity determinants of PTGS” (Figure 12.2).? 


THE RNA INTERFERENCE PATHWAY 


In 1998, Andrew Fire, Craig Mello and co-workers 
resolved the sense-antisense conundrum by showing 


Small RNAs with Mighty Functions 


139 


inoculated 
leaf 


syst. 
leaf 


> 
924610610 dpi 


S 


FIGURE 12.2 The identification of small sense and antisense RNAs in post-transcriptional gene silencing in plants. (Reproduced 
from Hamilton and Baulcombe” with permission of the American Association for the Advancement of Science.) The left panel 
shows that transgenic lines of tomato mRNA containing ACO transgenes, in which the endogenous ACO gene (involved in fruit 
ripening*?) has been silenced, express 25nt small sense and antisense RNAs. The right panel shows the accumulation of a 25nt 


RNA following infection with potato virus X. 


(inadvertently, as part of an experimental control) that 
addition of sense and antisense RNAs together potently 
triggered systemic and heritable gene silencing in 
C. elegans, which “was at least two orders of magnitude 
more effective than either single strand alone" 5! Although 
the mechanism of silencing was still unknown, these 
observations explained the RNAi phenomenon, sug- 
gesting that the previous silencing observed with “sense” 
RNAs was due to contaminating dsRNAs formed during 
in vitro RNA synthesis and antisense transcription. 

Independently, also in 1998, Peter Waterhouse and 
colleagues showed that gene silencing and transgene- 
mediated virus resistance in plants were induced and tar- 
geted by dsRNAs,* and Elisabetta Ullu and colleagues 
showed that dsRNA induced mRNA degradation in try- 
panosomes.*? It was soon also demonstrated that dsRNAs 
trigger gene silencing in Drosophila*^-*7 and mammalian 
cells.88-% 

These findings were followed by genetic dissection of 
the biochemical pathways and proteins involved.?'?? The 
first identified was an RNA-dependent RNA polymerase 


(RdRP) named qde-1 (for quelling defective mutant 1), 
which had been predicted to participate in RNA silenc- 
ing (as, by definition, they would produce dsRNA s).???* 
Although RdRPs had been detected in prokaryotes and 
eukaryotes decades before, mostly associated with RNA 
virus replication, RdRP activity had also been described 
in tissues of "apparently healthy plants" whose product 
seemed to be “double-helical RNA", with other reports 
showing the existence of non-viral RdRPs.%% In 1999, 
Carlo Cogoni and Giuseppe Macino showed that RdRP 
was involved in gene silencing?^ with others subse- 
quently showing that homologs are present not only in 
fungi, but also in many plants and nematodes,‘ indicating 
a common pathway.101-104 

Many other components of the *RNAi machinery 
were soon identified and shown to occur in plants, inver- 
tebrates and vertebrates.??!05 To cut another long story 


4 Some animals, including mammals, have lost RdRPs but have 
instead evolved a ‘ping-pong’ amplification system to amplify piR- 
NAs (see below).?8-100 
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short, work from the laboratories of David Bartel, Greg 
Hannon, Thomas Tuschl, Phil Zamore and others showed 
that long dsRNA is recognized by dsRNA-binding pro- 
teins (of which there are many versions) that interact with 
and activate an enzyme named Dicer’ to cleave the bound 
RNA into short dsRNA fragments of -20—23nt in length 
(which can be amplified by RdRP, hence making RNAi 
auto-catalytic),?^.101104.107-10? as had been proposed pre- 
viously,? explaining the powerful and relatively stable 
systemic and transgenerational silencing obtained with 
small initial amounts of RNAs. 

The short dsRNA fragments, called siRNAs (small 
interfering RNAs),? are then unwound and one of the two 
strands (the ‘guide strand’, determined by the strength of 
the hydrogen bond interaction at the 5'-ends of the RNA 
duplex, known as the ‘asymmetry rule”) is loaded into an 
effector complex, the ‘RNA-induced silencing complex’ 
(RISC). The guide strand pairs with a complementary 
sequence in an mRNA (or other target RNA molecule) 
and induces its cleavage by an “Argonaute* protein, the 
catalytic component of the RISC.99.107-109.113 

There is a large family of Argonaute proteins!!^!? and 
a range of small RNA signaling pathways, especially in 
plants.!!%-118 In eukaryotes, Argonaute proteins are found 
both in the cytoplasm, where they are involved in tar- 
get mRNA degradation, whereas others are nuclear.!? 
Argonaute orthologs also occur in prokaryotes,!!* where 
they chop foreign plasmids or phage DNA (or transcribed 
sequences from foreign DNA)? into small (13-25 nt) 
guide sequences, which then target and cleave comple- 
mentary DNA,!20-12 indicating that RNA-guided target- 
ing of genes for viral defense and gene regulation dates 
back to the dawn of biology." This system is similar 
to but distinct from bacterial and archaeal CRISPR- 
Cas systems that also derive guides from foreign DNA 


o 


Mammalian stem cells protect themselves from some RNA viruses 
(including Zika virus and SARS-CoV-2) by expressing an alterna- 
tively spliced isoform of Dicer, which potentiates antiviral RNAi, a 
process similar to that in insects and worms, which also use Dicer- 
dependent RNA interference in antiviral defense, in contrast to 
mammalian differentiated cells, which generally rely on the inter- 
feron system.!06 

So named because the first reported Argonaute mutant — in plants — 
yielded a phenotype that resembled the octopus Argonauta argo." 
Different classes of small RNAs bind to specific members of the 
Argonaute protein family.!'? 

It is not yet known how the Argonaute proteins distinguish invad- 
ing DNA from host sequences, "?? although one possibility is that it 
involves base modification. 

The cooption of endogenous dsRNA “killer viruses’ by the yeast S. 
cerevisiae to gain competitive advantage over related strains may be 
a least a partial reason for the loss of the RNAi pathway in this and 
some other fungal species.!25,126 
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(see below), although some CRISPR-Cas loci encode 
Argonaute proteins that associate with RNA guides, sug- 
gesting that the two mechanisms work synergistically in 
some organisms,120,124 

Moreover, the identification of highly conserved pro- 
tein complexes that mediate small regulatory RNA func- 
tions is consistent with principles found in RNA control 
of other processes, such as the actions of snRNAs and 
snoRNAs in eukaryotes (Chapter 8), and sRNAs in bac- 
teria, where Hfq acts as a general cofactor for antisense 
RNAs that rely on short stretches of base pairing?" 1? 
(Chapter 9). 

Genetic studies showed that membrane-bound 
dsRNA-binding proteins are required for systemic 
transmission of RNAi.!% There are two such proteins in 
C. elegans,?*5 and two orthologs in vertebrates, the lat- 
ter of which transport internalized extracellular dsRNAs 
from endocytic compartments into the cytoplasm for 
immune activation.'*°!37 However, while systemic RNAi 
transmission is well understood in plants,'3 the extent of 
and mechanisms for exocrine RNA transport and signal- 
ing in mammals! are unclear; in part it appears to involve 
packaging of small RNAs (including microRNAs — see 
below) into exosomes and other circulating vesicles, !4!-1^4 
reported to alter gene expression in immune and other 
cells, 5-5? and to influence aging,'*! as well as tumor 
invasion and metastasis in cancer, ^-^ although there 
are conflicting reports.!5»56 


TRANSCRIPTIONAL GENE SILENCING: 
RNA-DIRECTED DNA METHYLATION 


Consistent with a nuclear role, there is now consider- 
able evidence that TGS (as distinct from PTGS) imposed 
by dsRNAs involves methylation of cognate sequences 
in the genome, 9.57716? which, although at that time best 
documented in plants, led Wassenegger to predict in 
2000 that “RNA-directed DNA methylation is involved 
in epigenetic gene regulation throughout eukaryotes", 
including mammalian X-chromosome inactivation and 
parental imprinting.'* This prediction was confirmed in 
2004 when Manika Pal-Bhadra, Tatsuo Fukagawa and 
colleagues showed that RNAi is required for heterochro- 
matin formation in Drosophila and vertebrate cells,!€?.165 
and Kevin Morris and colleagues showed that siRNAs 
can induce methylation and TGS of homologous loci in 
mammalian cells,'% since confirmed by many others! 
and shown to also involve Argonaute proteins.!66-165 


i First mooted by Dickson and Robertson in 1976? (Chapter 5) and by 
Stephen Benner in 1988.9 
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The groups of Rob Martienssen, Shiv Grewal, Danesh 
Moazed, Robin Allshire, Caroline Dean and others 
showed that RNAi pathways also control many aspects 
of chromosome structure and dynamics.!%-17 They 
demonstrated that transposon-dense heterochromatic 
regions are not transcriptionally inert, but express RNAs 
with key functions, many of which involve targeting of 
nascent transcripts and chromosome-associated RNAs 
by small RNA species.!59/5! These include regulation of 
centromere function,!$2-185 meiotic and mitotic chromo- 
some segregation, repression of meiotic genes by tar- 
geting nearby retrotransposon-derived repeats,!8 other 
aspects of meiotic progression including chromosome 
condensation,'® rectification of copy number imbal- 
ances,!% genome rearrangements!?^/55 and chromosome 
reassembly,!'5%1% plant flowering time?! and DNA dam- 
age responses from fission yeasts to mammals,!6?.192.193 
with interconnections to transcription and splicing.!90.5! 
RNAi pathways are required for transposon silencing, 
thought to protect the genome against uncontrolled trans- 
position of mobile elements,!%-1% but have been adapted 
to regulate developmental processes.!56.199.200 

In plants, RNA-directed DNA methylation induces 
TGS through the production by specialized plant RNA 
polymerases of 24nt RNAs that are loaded into AGO4,??'- 
203 which then recruits DNA methyltransferase and directs 
DNA methylation to regulate plant-pathogen interactions, 
stress responses and development.!*%204207 Similar path- 
ways operate in animals, although less is known about 
them. !67169.196-198.205 The notable common feature is the use 
of embedded transposon-derived sequences as the targets 
of locus-specific regulation by RNAs, which may explain, 
in large part, the widespread presence of such sequences in 
genomes of higher organisms (Chapters 10 and 16). 

The observations in C. elegans and other organisms 
showed that RNA signals also mediate transgenerational 
epigenetic inheritance,5!2092!! requiring, inter alia, a 
nuclear Argonaute protein?? and RNA modification? 
(Chapter 17). 


RESEARCH AND BIOTECHNOLOGY 
APPLICATIONS 


The ability of natural or modified small siRNAs, deliv- 
ered directly or via a vector that produces a ‘small/short 
hairpin RNA’ (shRNA), to inhibit gene expression specif- 
ically and potently in plants and animals has been widely 
adopted to investigate and modulate gene function in vivo 
and in vitro. 
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In mammals, siRNA was first used to ‘knock down’ 
protein-coding genes in oocytes and newly fertilized 
zygotes by Florence Wianny and Magdalena Zernicka- 
Goetz, where the phenotypes observed were similar to 
those observed with null mutants of the endogenous 
genes. siRNA-directed gene knockdown was later 
shown to be generally applicable to mammalian cells*? 
and applied to systematically target protein-coding genes 
in culture to identify those involved in various aspects of 
cell biology (e.g..?^), viral replication (e.g.,?^) and drug 
responses or sensitivity (e.g.,216). 

The attraction of RNAi (or any nucleic acid-based 
system) for human therapy is substantial, as it flexibly 
employs a relatively common chemistry to inhibit the 
expression of target genes, notwithstanding the possibil- 
ity of off-target effects, which are progressively being 
addressed.?" The longstanding problem is in vivo deliv- 
ery, especially tissue specificity and transport across the 
blood-brain barrier to treat neurological disorders,?!? as 
dsRNAs are not easily absorbed except by the liver.?!? 
The delivery problem has been solved in part by uti- 
lizing viruses, notably AAV (adenovirus-associated 
virus, because of its low immunogenicity, many tissue- 
specific serotypes, and low rate of chromosomal inte- 
gration), as well as other approaches and vehicles such 
as artificial exosomes, liposomes, nanoparticles and 
peptide-conjugates that have been engineered to deliver 
stably expressed shRNAs and small and large RNAs 
in vivo.220-225 

There are less technical problems in plants, and 
shRNA-based transgenes have been widely used to engi- 
neer or to exogenously deliver, for instance, viral and 
fungal resistance in horticultural crops and ornamental 
plants.226227 

Viral vectors may also be used to deliver normal 
copies of defective genes, although their payload is lim- 
ited by the amount of DNA that can be packaged in the 
virus, and there may be residual immunogenicity and 
genotoxicity problems.?$22522 Nonetheless, AAV vec- 
tors carrying replacement genes or antisense oligonucle- 
otides to rescue or bypass splicing defects have been 
used successfully to treat different conditions, such as 
spinal muscular atrophy,”*°?! muscular dystrophy??? and 
retinal dystrophy??? 

However, the use of siRNA and shRNA in research 
and gene therapy is being rapidly supplanted by a far 
more efficient and precise genome manipulation system, 
based on RNA targeting of engineered nucleases, called 
CRISPR, discussed later in this chapter. 
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MICRORNAs 


Closely related endogenous RNA regulatory pathways that 
regulate differentiation and development were waiting in the 
wings to be discovered. As described in Chapter 9, a 22nt 
small RNA, lin-4, had been described by Victor Ambros 
and colleagues in 1993 to control developmental timing in 
C. elegans.?** In 2000, a second small (-21-nt) regulatory 
RNA species, let-7, was identified by Gary Ruvkun and 
colleagues, produced from a locus that, like /in-4, controls 
developmental timing by negatively regulating the lin-41 
(protein-coding) gene through partial complementarity to 
target sequences in its 3'UTR.2% The similarity to the lin-4 
system led the authors to conclude that 


“the mechanism by which they [lin-4 and let-7] 
regulate the expression of their target could be 
related” and that they “may constitute elements 
of a cascade of stage-specific regulatory RNAs 
that control the temporal sequence of events in C. 


elegans development? 


Shortly thereafter, Amy Pasquinelli, Ruvkun and col- 
leagues found that the let-7 sequence, size, and ‘hairpin’ 
structure of its precursor are highly conserved in inver- 
tebrates and vertebrates,2% with the latter having mul- 
tiple paralogs.?7 The timing of let-7 expression during 
development and its target sites in the lin-41 mRNA are 
also conserved.2%6238 These “small temporal RNAs”2% 
became known as ‘microRNAs’ (miRNAs). 

The similarity of size of miRNAs and siRNAs (and 
the small RNAs described in plant PTGS), suggested 
a relationship between them,?* confirmed when Alla 
Grishok and colleagues found that inactivation of Dicer 
and other components of the RNAi pathway impaired 
miRNA processing and activity??? 

After the realization of the possible existence of other 
small RNA “classes”, and despite “lingering skepticism” 
and wariness “of what we imagined was a vast detritus 
of partially degraded RNAs in cells”,?% the development 
of strategies for large-scale isolation and sequencing of 
small RNAs with “probable regulatory roles”,?9724122 as 
well as bioinformatic searches using the available genome 
sequences,” led to the cloning and characterization of large 
numbers of small -21nt RNAs from a variety of organisms, 
including mammals,?37:241.242,244-24 “deviants no longer”.?°° 

As Ruvkun observed at the time: 


why have more of the miRNAs not been revealed 
by genetic analysis? ...the number of genes in the 
tiny RNA world may turn out to be very large, 
numbering in the hundreds or even thousands in 
each genome. Tiny RNA genes may be the bio- 
logical equivalent of dark matter — all around us 
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but almost escaping detection, until first revealed 
by C. elegans genetics.2! 


Subsequent studies showed that miRNAs are also pro- 
duced from processed introns and other non-coding 
transcripts) including RNAs derived from repeti- 
tive sequences, that form an internally folded dsRNA 
stem-loop structure with an imperfectly paired double- 
stranded stem.25+25%26 This structure is recognized and 
cleaved from precursor host RNAs by a ‘Microprocessor’ 
complex comprised of the RNaselll endoribonuclease 
Drosha and its partner dsRNA-binding protein Pasha/ 
DGCR8,2%%2 or generated directly from the debranch- 
ing of lariat intermediates formed during intron exci- 
sion (mirtrons")7'? to yield a hairpin-shaped RNA 
of «65-70 nucleotides in length. This “pre-miRNA' is 
then cut by Dicer in the cytoplasm, loaded into the RISC 
complex, which (after ejection of the ‘passenger’ strand) 
targets cognate mRNAs for deadenylation of their polyA 
tails and degradation and/or translational inhibition 
by binding to complementary sequences primarily in 
3'UTRs (Figure 12.3).91239.274-276 

Thousands of tissue-specific miRNAs?4*277-280 — some 
of which are conserved over long evolutionary distances,??! 
whereas others are more lineage-restricted247-249282 — 
have been identified and shown to regulate a wide range 
of differentiation and cellular processes.?/^277285284 
These include maintenance of pluripotency and stem cell 
states,?77:285 cell proliferation,?8 epithelial to mesenchy- 
mal transition,??? brain morphogenesis,?8% hematopoietic 
differentiation,?*? neuronal asymmetry,?% dendritic spine 
development,” muscle development??? programmed cell 
death??? innate immune responses,??^ metabolism, leaf 
development??? and flowering time,? among many 
others.* MicroRNAs also undergo developmentally regu- 
lated post-transcriptional modifications,?-30 and exhibit 
altered expression in developmental disorders and can- 
cers?01-303 (dubbed *oncoMiRs'??^), 

A feature of miRNAs, which generally distinguishes 
them from siRNAs, is that the precursor dsRNA struc- 
ture that is recognized by Drosha does not have perfect 


Examples, among many others, include the non-coding RNA BIC 
(B-cell integration cluster) originally identified as a common site 
for proviral insertion in lymphomas?%22% and classic IncRNAs such 
as H19,2%6,257 similar to small nucleolar RNA host genes. At least 
some (like H19) have a dual function as regulatory IncRNAs, such 
as the miR-205 host transcript that regulates pituitary hormone 
production.?** 

Despite their ubiquity, miRNAs have rarely shown up in genetic 
screens (see Chapter 13). Indeed, prior to their biochemical dis- 
covery, only one miRNA locus had originally been picked up in 
Drosophila — a gene called bantam — and was not initially recog- 
nized as a microRNA.?% 


x 
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FIGURE 12.3  Biogenesis and mechanism of action of the main classes of small regulatory RNA. (Reproduced from Jinek and 


Doudna?” with permission of Springer Nature.) 


base complementarity, and either strand may be utilized. 
Importantly, both miRNAs and siRNAs can repress gene 
expression without activating the interferon pathway,*? 
which is triggered by longer dsRNAs.?? MicroRNAs 
appear to have evolved before multicellularity, as they 
are also found in unicellular algae,>°° although their rep- 
ertoire expanded greatly during metazoan, vertebrate and 
mammalian evolution.307.308 

Curiously, it appears that an individual miRNA may 
recognize sites in many mRNAs,??3!! and that many, if 
not most, mRNAs, also contain bona fide recognition sites 
for multiple miRNAs.*!2313 The regulatory logic for this 
reciprocal multiplexing is unclear and, while miRNAs 
appear to locate their targets through an 8-nucleotide 
‘seed’ sequence,??^3? the rules for target recognition are 
still being teased out,?!4?!6 as also is the place of miRNAs 
in decisional hierarchies (Chapter 15)?" 


Although miRNAs are thought to operate primarily 
on mRNAs in the cytoplasm, they also operate in the 
nucleus to target pre-mRNAs,*!® as well as enhancer 
RNAs?” (to modulate chromatin architecture; Chapters 
14 and 16) and nuclear RNAs such as 7SK.?? They also 
bind to long non-coding RNAs, sometimes thought to 
function as miRNA ‘sponges’ or ‘decoys’,*?!*? although 
they are likely legitimate and common targets in their 
own right. Some miRNAs also trigger the production 
of 21-22nt siRNAs from transposons,?^ referred to as 
‘easiRNAs’ (epigenetically activated small interfering 
RNAs), to control genome dosage response and hybrid- 
ization barriers in plants.?? MicroRNAs also influence 
nuclear DNA methylation, 63? but the intersecting small 
RNA pathways and networks have yet to be understood. 

Nonetheless, a sort of consensus has emerged. In 
general, siRNAs are perfectly complementary to their 
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FIGURE 12.4 Biphasic production of piRNAs and Miwi expression during mouse sperm development, (Adapted from Zheng 
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and Wang,?? under Creative Commons Attribution License.) 


targets and promote their degradation, and in plants are 
used for defense against viruses and transposons, as well 
as adaptive responses to environmental stress.?? siRNAs 
are also produced in animals - in oocytes, embryonic 
and somatic cells, from pseudogenes! and TEs and other 
naturally formed dsRNAs?!'2? - where they regulate 
transposons, genome dosage response and developmen- 
tal processes, likely interrelated.+195,198,265,329,330,333-337 On 
the other hand, miRNAs are imperfectly complementary 
to mRNA (usually) 3'UTR sequences and often direct 
translation inhibition, as well as less well characterized 
targets, for the regulation of developmental processes as 
described above. 


PIWI-ASSOCIATED RNAs 


In 2006, a related general class of small RNAs was 
described,*** although they had been detected a few 
years earlier by Alexei Aravin and colleagues in analy- 
ses of tandem repeat and retrotransposon silencing?’ 
and small RNA sequencing datasets in the testis.) A 
subgroup of Argonaute proteins, known as the ‘Piwi 
family'," was known to be required for germ- and stem- 
cell development in animals, three homologs (Mili, 
Miwi and Miwi2) being essential for spermatogenesis 


! Processed from duplexes between antisense pseudogene (non-cod- 
ing) RNAs and corresponding mRNA s.??! 

™The name given to the original Drosophila mutant: P-element- 
induced wimpy testis (Piwi).5% 


in mammals*%55% (Figure 12.4). These proteins bind 
*Piwi-interacting RNAs', or piRNAs, which are longer 
than miRNAs, generally ranging from 25 to 33nt, and 
have a characteristic 2'O-methylation" of the 3' terminal 
ribose,?^5 which is recognized by the PAZ domain of Piwi 
proteins,*%5% with those interacting with different Miwi/ 
Mili proteins having a different mean length.?46551352 

Thousands of genomic loci have been found to produce 
piRNAs.? These RNAs are present in millions of copies in 
testes and are so abundant that, like the discovery of small 
nuclear RNAs in the 1960s, they were discovered by visu- 
alization in gel electrophoretic displays. PiRNAs are also 
expressed in ovaries*%35 and somatic cells,?^9? where 
they regulate gene expression in stem cells and embry- 
onic development,?^5* and in the brain, where they are 
required for neurogenesis and memory formation?90365 
(Chapter 17). piRNA expression is also dysregulated in 
cancer. They are not produced by Drosha processing of 
dsRNA intermediates but the best understood mechanism 
for piRNAs derived from repeat sequences and present in 
diverse organisms is through a ‘ping-pong’ amplification 
system from single-stranded RNA templates.98-1% 


n All miRNAs and siRNAs in plants and siRNAs in Drosophila 
are 2’-O-methylated at their 3’ ends, apparently to improve their 
stability.?^ 

? There are over 2,500 miRNA and over 23,000 piRNA candidate 
loci documented in humans, rivaling the number of protein-coding 
genes?! 
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piRNAs are thought to be (mainly) involved in the 
repression of transposons,? likely their ancestral func- 
tion./26346352367571 They are also involved in parental 
imprinting.*? However, their range is much greater. For 
instance, male-specific chromosome Y-derived long 
non-coding transcripts, named Pirmy and Pirmy-like 
RNAs, which exhibit a large number of splice variants 
in testis, act as templates for piRNAs that regulate auto- 
somal genes.” piRNAs are also derived from other 
sequences, including intergenic transcripts, pseudogenes 
and the 3'UTRs of mRNAs,!?2755 the latter targeting 
the introns of other genes. Along with their roles in 
development and brain function, these observations lead 
to the conclusion that genomes have co-opted the TE- 
derived piRNA system for regulating transposon activity 
and other dimensions of genome regulation.375-% These 
include the modulation of nuclear architecture at target 
loci containing TEs, shown in a Drosophila ovarian 
somatic cell line and involving directing these genomic 
regions to the nuclear periphery, heterochromatin forma- 
tion and chromatin conformational changes to promote 
transcriptional silencing; thus providing a possible mech- 
anism for the proposed role of TEs as "functional motifs" 
for the dynamic regulation of chromatin state and nuclear 
architecture380.381 (see Chapters 14 and 16). 

In sperm development in Drosophila and mammals, 
there are two bursts and classes of piRNAs and two cor- 
responding rounds of Piwi protein expression. The first 
class is expressed before the meiotic ‘pachytene’ stage of 
spermatogenesis, is derived from transposon-rich clusters, 
and associates with Mili and Miwi2. The second class is 
expressed during the pachytene stage from thousands of 
genomic loci, is extremely abundant and associates with 
Mili and Miwil.*8?-38*+ Their function is unknown,?85 
although an intriguing possibility is that this dual system 
evolved to enhance evolvability (Chapter 18), with qual- 
ity control, when progeny numbers are small.38 There is 
widespread formation of dsRNAs in the testis, different in 
profile from that observed in somatic cells.387 Moreover, 
the loss of fertility in Piwi mutants lacking piRNAs is not 
due to de-repression of repetitive elements: pachytene piR- 
NAs repress gene expression by directing the cleavage of 
meiotic transcripts required for sperm function.388,389 

It may also be that the phenotypes of the loss of piRNA 
function are mediated by epigenetic silencing of histone 
genes,?? implicating small RNAs in the transmission of epi- 
genetic information across generations (Chapter 17). With 


P Involving the long-described but mysterious, germ-cell-specific 
‘nuage’ organelle,*% which forms a phase-separated domain (Chapter 
16) that concentrates single-stranded DNA but largely excludes dou- 
ble-stranded DNA .?66 
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many more piRNA loci yet to be studied, there is much to 
be learned about this large class of small RNAs in animals. 


OTHER CLASSES OF SMALL RNAs 


The landscape is even more complicated, as most RNAs 
form double-stranded structures that can be and fre- 
quently are processed into small RNAs. 

A variety of small RNA species are produced from 
tRNAs (TREs”) in a developmental stage-specific man- 
ner in fungi, plants and animals.??!?* Most derive from 
28S rRNA, and have various sizes, some shown to be 
siRNAs (‘phasiRNAs’), miRNAs and piRNAs, as well 
as 'giRNAs'?? that function in DNA damage response.? 
The 18S rRNA is also the source of smaller RNAs, 
including miRNAs, as well as a longer species of 130nt 
that may be an miRNA precursor. miRNAs are also pro- 
cessed from the 5.8S rRNA and the 45S rRNA precur- 
sors.9 The piRNA pathway is required to control the 
abundance of rRNA-derived siRNAs and the ribosomal 
RNA pool in C. elegans.?* 

In all branches of life, tRNAs are processed into 
highly abundant smaller species (tRFs for tRNA- 
derived fragments, or 'tsRNAs' for tRNA-derived small 
RNAs),?9?-^9 initially thought to be degradation prod- 
ucts, ^ despite their production being cell-state and 
tissue-specific???405406 and (at least in some cases) Dicer- 
dependent,94?7 possibly playing a role in the global 
regulation of RNA silencing.^* Some tRNA fragments 
function as miRNAs.^9-4! The production of tRFs 
is influenced by methylation of the tRNA by the RNA 
methyltransferases Dnmt2*!? and Nsun2, which may be 
involved in transgenerational inheritance (see below) and 
loss of which causes neurodevelopmental disorders.*!* 
tRNAs and tRNA fragments are also modified and 
regulated by TET2, which oxidizes 5-methylcytosine in 
DNA"! (Chapter 14). tRFs and modifications thereof are 
stress-induced and inhibit translation,45-4^? control ret- 
rotransposons,?9.?! promote cell proliferation,*! and are 
depleted in immortalized cell lines.^96 

tRFs also convey long distance regulatory signals. 
Their levels are increased in Arabidopsis roots upon 
phosphate starvation,*? and rhizobial (symbiotic bacte- 
rial) tRFs modulate host nodulation in legumes via the 
RNAi pathway.*% They are present, along with miR- 
NAS, in secreted exosomesi from stem cells?4+ and 


4 Similar 20-35nt small 'ddRNAs' have been reported elsewhere to 
control the DNA damage response via interaction with damage- 
induced long non-coding RNAs (Chapter 13) and the formation 
of phase-separated domains (Chapter 16) at DNA double strand 
breaks.395-397 
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activated T-cells.* They are also abundant in semen and 
sperm, 9.26477 and have been found to contribute to inter- 
generational inheritance of acquired metabolic disorders, 
with evidence that this involves their methylation+28-430 
(see Chapter 17). tRFs have also been shown to affect 
the biogenesis of snoRNAs, scaRNAs and snRNAs, 
as well as histone levels and chromatin organization?! 
(Chapter 14). 

tRNA-like small RNAs are also produced from the 
processing of long non-protein-coding RNAs, such as 
MALATI1,%24 one of the most highly expressed genes in 
the vertebrate genome, which also produces piRNAs.%* 

All known snoRNAs (in fungi, plants and animals, 
from yeast to human) are processed to produce smaller 
species: those derived from C/D snoRNAs show a 
bimodal size distribution of 17-19nt and >27nt, the lat- 
ter similar in size to piRNAs.**° Those derived from H/ 
ACA snoRNAs are predominantly 20—24nt in length? 
(Figure 12.5), at least some of which have been shown to 
be Dicer-dependent, bind to Argonaute, function as miR- 
NAs and have developmental effects.265435-439 

Likewise, spliceosomal snRNAs are processed into 
smaller fragments,*% although their significance is 
unknown. Vault RNAs (Chapter 8) are processed as well, 
dependent on cytosine methylation, into small RNAs 
that bind to Argonaute proteins and exhibit miRNA- 
like functions.^? Other short (80—100nt) intronically 
derived RNAs, some of which are highly conserved, also 


# of 

reads 
1) AAAGGUAGAUAGAACAGGUCUUGU 
2) AAAGGUAGAUAGAACAGGUCUUG 
5 AAGGUAGAUAGAACAGGUCUUGU 

(21 AAGGUAGAUAGAACAGGUCUUG 
1 AAGGUAGAUAGAACAGGUCUU 
1 AAGGUAGAUAGAACAGGUCU 
1 AGGUA GAUA GAA CAGGUCUUGUU 
2 AGGUAGAUAGAACAGGUCUUGU 
5 AGGUAGAUAGAACAGGUCUUG 
1 AGGUAGAUAGAACAGGUCUU 
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associate with Argonaute proteins and are capable of 
repressing mRNAs via target sequences in the 3'UTRs.^*! 
All of these observations indicate deep evolutionary and 
complex functional connections between regulatory 
RNA pathways. 

Finally, another class of small RNAs with a modal 
length of 17-18nt is associated with transcription start 
sites (‘tiRNAs’) and exonic splice sites (‘spliRNAs’) in 
animals, but not in plants, with evidence that they are 
associated with binding sites for epigenetic regulators 
and mark nucleosome positions,**-44 which may be 
subject to fine control in animals because of the required 
precision of ontogeny (Chapter 15). 


RNA COMMUNICATION BETWEEN SPECIES 


Some of the early experimental RNAi methods in C. ele- 
gans indicated the potential of interspecies communica- 
tion by small RNAs: “dsRNAs expressed in Escherichia 
coli could induce gene silencing in nematode larvae that 
feed on them ... raising the possibility that dsRNAs 
may be transferred from prokaryotic pathogens to their 
hosts”.445 

Small RNAs have subsequently been shown to medi- 
ate communication between species, for pathogenicity, 
defense or symbiosis. For example, the small RNA SsrA 
produced by the bacterium Vibrio fischeri is loaded into 
outer membrane vesicles that are transported into the 


FIGURE 12.5 Processing of an miRNA from snoRNA ACAAS. The blue bar shows the sequences of the putative miRNA prod- 
ucts and the number of reads in Argonaute-associated small RNA sequencing data, while the red bar shows putative “star” product 
from the stem-loop precursor. (Reproduced from Ender et al. with permission of Elsevier.) 
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epithelial cells in the light organ of the squid and sup- 
press antimicrobial responses and permit light-organ 
symbiosis between the bacterium and its host.*% 

The soil bacterium Pseudomonas aeruginosa, which 
is pathogenic in many species, releases a small RNA 
termed P11 into C. elegans that is processed by the RNAi 
and piRNA machinery, and specifically silences the 
expression of the protein maco-1, which functions in che- 
motaxis and neuronal excitability, leading to avoidance 
behavior that is heritable for four generations, of value 
to both.“ C. elegans transmits this memory of learned 
avoidance to naive animals and four generations of prog- 
eny through virus-like particles encoded by the Cerl 
retrotransposon.*48 

Similarly, the bacterial pathogen Listeria monocyto- 
genes secretes a small RNA-binding protein named Zea 
that binds a subset of L. monocytogenes RNAs and RIG- 
I, the non-self-RNA innate immunity sensor in mamma- 
lian cells. In mice, this was shown to dampen the immune 
response and suggested to allow sustainable host-bacte- 
rium symbiosis.^? Zea orthologs occur in other bacteria 
that are rarely associated with disease, suggesting that this 
may be a general mechanism to assist co-existence between 
bacteria and multicellular hosts,^? especially in view of the 
growing awareness of the importance of the enteric micro- 
biome in ruminant biology and mammalian physiology. 

These may occur between eukaryotic kingdoms, 
as shown between the mycorrhizal fungus Pisolithus 
microcarpus and its host Eucalyptus grandis, in which 
the fungus miRNA Pmic_miR-6 is transported into the 
plant roots, targeting host genes to promote symbio- 
sis.“ Such interactions may influence pathogenicity and 
aggressiveness of parasites, as exemplified by the inter- 
action between the fungal pathogen Botrytis cinerea and 
host plants, proposed to involve retrotransposon-derived 
small RNAs derived from the fungus and delivered into 
plant cells and cross-kingdom RNA interference (ckR- 
NAi) to undermine host defense.^! 

Even complex ecological interactions may be 
affected by regulatory RNAs. This is suggested by the 
reported effect of short satellite RNAs (satRNAs, a 
type of subviral agent associated with some viruses, 
Chapter 8) that accompanies cucumber mosaic virus 
(CMV) and are processed into small RNA (sRNA) spe- 
cies in infected tobacco leaves. The production of these 
Y-sat sRNAs affects host gene expression and leads to 
a change of the color of the leaves from green to bright 
yellow, which preferentially attracts aphid vectors. 
Surprisingly, the Y-sat RNAs are also processed in the 
aphids feeding on the leaves, leading to the production 
of small RNAs that affect aphid physiology, turning 
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them red and subsequently promoting wing formation. 
This in turn facilitates CMV and Y-sat RNA spread 
without apparent detrimental effects on the plant, thus 
suggesting a potential intricate “survival strategy' by 
a subviral non-coding RNA, which depends on inter- 
action with (and affects the interactions between) 
the CMV helper virus, the host plant and the insect 
vector.^? 

It is early days in the exploration and understand- 
ing of the extent of communication by “social RNA" 
(a term coined by Eric Miska)^?^* in the competition 
and cooperation within and between species in com- 
plex environments, but there are further documented 
examples of RNA exchange, including between para- 
sitic nematodes and mammalian cells,+° plants and 
fungal pathogens, 99457 parasitic and host plants, ^5 
and even between queen and worker bees via royal and 
worker jellies,*°°4 a pathway which has been exploited 
to artificially introduce virus resistance to commercial 
hives.^6! 


CRISPR 


The exponential growth and analysis of DNA and 
RNA sequence data transformed molecular biology, 
no better demonstration being the serendipitous dis- 
covery" of the CRISPR RNA-based immune system in 
prokaryotes 466-470 

In 1987, Yoshizumi Ishino and colleagues reported 
the existence of mysterious, partly palindromic -30nt 
tandemly repeated sequences, separated by ~30 nt 
‘spacers’ in E. coli“! A similar peculiar array of 
interspersed repeats was described in the salt-toler- 
ant archaeon Haloferax mediterranei by Francisco 
Mojica and colleagues in 1993, who went on to col- 
late instances of such sequences in many different 
species, which they called “Short Regularly Spaced 
Repeats'.7247%3 Others doing related bioinformatic 
analyses showed that a common set of protein-cod- 
ing “cas” genes lay adjacent to these repeats, which 
they termed ‘Clustered Regularly Interspaced Short 
Palindromic Repeats’, whose acronym, CRISPR, 
being more euphonic, became adopted. 


* Another serendipitous discovery was bacterial ‘retrons’ which were 
first discovered in 1989 in myxobacteria and E. coli.^92^9 Retrons are 
chimeric covalently linked RNA/DNA molecules produced by reverse 
transcriptase, prior to which reverse transcriptases had not been 
known to exist in bacteria.*% Later studies showed that retrons form a 
second line of defense against phages (abortive infection by cell death) 
that is triggered if the first lines of defense have collapsed.*%* 

s ‘CRISPR-associated’. 
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The function of CRISPR was revealed in 2005, when 
three groups, including Mojica’s,' discovered that the 
spacers separating the repeats corresponded to frag- 
ments of bacteriophages and plasmids, indicating that 
these sequences recorded past invasions and constituted 
a memory-defense system,^77-^? reinforced by the bioin- 
formatically inferred nuclease function of the CRISPR- 
associated cas genes and analogies with RNA 1.479480 

This conclusion was confirmed when Philippe 
Horvath, Rodolphe Barrangou, Sylvain Moineau and col- 
leagues working at the French company Danisco" showed 
in 2007 that recently selected phage-resistant mutants of 
the lactic acid-producing bacterium Streptococcus ther- 
mophilus, used in yogurt and cheese production, had 
acquired phage-derived sequences at their CRISPR loci 
and that phage isolates that overcame the immunity had 
single nucleotide changes that altered the sequences cor- 
responding to the spacers.%! These and other investiga- 
tors, notably Luciano Marraffini, John van der Oost and 
colleagues, also showed that the CRISPR array is tran- 
scribed and cleaved within the repeats to produce 61 nt 
'erRNAs' wherein the spacer (target) sequence is flanked 
by the palindromic sequences of the repeats. The crRNAs 
are then loaded into a cas-encoded nuclease,*81-48% which 
uses the spacer guide sequence to target and introduce 
a double strand break in cognate DNA via two distinct 
nuclease domains." The Cas proteins are essentially RNA- 
programmable restriction enzymes^57455 (Chapter 6), 
some of which are also used for transposon homing.*894° 

A vital missing link was discovered when Emmanuelle 
Charpentier and colleagues, who had earlier shown that 
streptococcal virulence factors are controlled by regula- 
tory RNAs,?! found that the third most highly expressed 
sequence (after rRNA and tRNA) in Streptococcus pyo- 
genes is a small RNA that is transcribed from a sequence 
adjacent to the CRISPR locus and has 25 bases of near- 
perfect complementary to the repeats. This RNA, termed 


Mojica's paper was first submitted for publication in 2003, but not 
published until early 2005, due to multiple rejections. In fact all three 
papers were rebuffed by leading journals, being considered by their 
editors to lack sufficient novelty and importance to send out for peer 
review.*% A similar story of editorial rejection and reviewing delays 
by leading journals robbed Giedrius Gasiunas and colleagues of rec- 
ognition for the co-discovery of RNA programmable Cas9-crRNA 
target cleavage.17976 
As in many other cases, like antisense RNAs and RNAi (to gener- 
ate virus-resistant transgenic plants) some practical applications 
of CRISPR preceded mechanistic understanding. The company 
(Danisco), which was acquired by DuPont, announced in 2012 the 
commercial availability of several S. thermophilus strains immune 
to various bacteriophages for pizza cheese-making. 
' Cas9 contains a HNH-like nuclease domain that cleaves the DNA 
strand complementary to the guide RNA sequence, and an RuvC- 
like nuclease domain that cleaves the opposite DNA strand.^554*6 
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‘tracrRNA’ (trans-activating CRISPR RNA), hybridizes 
to the transcribed CRISPR array to guide cleavage by 
host-encoded RNaselII,4”2 is loaded into the cas nuclease 
along with a crRNA, and is essential for its function^*? 
(Figure 12.6). An endogenous guide RNA autoregulates 
CRISPR-Cas expression to mitigate autoimmune toxicity. 

There are a number of variants of CRISPR systems 
in bacterial and archaeal genomes.^994?7 Type I and type 
III systems employ a large multi-Cas protein complex for 
crRNA binding and target cleavage.^99^97 The simplest, 
Type II, consists of the CRISPR array, the gene encoding 
the tracrRNA, and four protein-coding cas genes, three of 
which are involved in generating new spacers and repeats 
and the other, cas9, encodes the targeting nuclease.**° 

Virginijus Siksnys, Horvath, Barrangou and colleagues 
showed that the CRISPR system from S. thermophilus 
could be functionally transferred into E. coli and that 
all that was required to target and cleave a specific DNA 
sequence is the tracrRNA, the Cas9 nuclease and the cor- 
responding crRNA,8 which could be as short as 20 nt 
as the flanking repeat sequences are not required, pav- 
ing "the way for engineering of universal programmable 
RNA-guided DNA endonucleases "^? 

Shortly thereafter, Charpentier, Martin Jinek, Jennifer 
Doudna and colleagues showed that the normally base- 
paired cRNA and tracrRNAs could be covalently linked 
and produced as a single chimeric guide RNA (sgRNA), 
simplifying delivery of the system.*85486 

CRISPR systems have also been coopted naturally 
for various structural and regulatory functions by repur- 
posing of diverged repeats encoded outside of CRISPR 
arrays, including by transposons for RNA-guided trans- 
position, by plasmids for interplasmid competition, and 
by viruses for antidefense and interviral conflicts. There 
are also multiple highly derived CRISPR variants of yet 
unknown functions.^?? 


RNA-DIRECTED GENOME EDITING 


In 2013, the groups of Feng Zhang and George 
Church® demonstrated that the CRISPR system (Cas9 
plus sgRNAs) could be used in human and mouse cells 
to introduce sequence-specific double strand breaks," 
which, when repaired in vivo, lead to incorporation of 
guided mutations, the latest in a long line of ‘homing 
endonucleases’ with wide practical applications.50%505 
Zhang and colleagues showed that they could engineer 
the Cas9 sequence to disable one of its two nuclease 
domains, converting it into a ‘nickase’ enzyme that only 


w Group II introns were the first RNAs to be used for gene/genome 
targeting.502-504 
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FIGURE 12.6 The type II CRISPR-Cas9 System from Streptococcus thermophilus. (Reproduced from Lander with permis- 
sion of Elsevier.) Like RNAi, CRISPR-Cas systems rely on RNA guidance for target specificity. (a) The locus contains a CRISPR 
array, four protein-coding genes (cas9, cas1, cas2 and cns2) and the tracrRNA. The CRISPR array contains repeat regions (black 
diamonds) separated by spacer regions (colored rectangles) derived from phage and other invading genetic elements. The cas9 
gene encodes a nuclease that confers immunity by cutting invading DNA that matches existing spacers, while the cas/, cas2 and 
cns2 genes encode proteins that function in the acquisition of new spacers from invading DNA. (b) The CRISPR array and the 
tracrRNA are transcribed, giving rise to a long pre-crRNA and a tracrRNA. (c) These two RNAs hybridize via complementary 
sequences and are processed to shorter forms by Cas9 and RNase III. (d) The resulting complex (Cas9+tracrRNA +crRNA) then 
searches for DNA sequences that match the spacer sequence (shown in red). Binding to the target site also requires the presence 
of the protospacer adjacent motif (PAM), which functions as a molecular handle for Cas9. (e) Upon binding to a target site match- 
ing the crRNA sequence, Cas9 cleaves the DNA three bases upstream of the PAM site. Cas9 contains two endonuclease domains, 
HNH and RuvC, which cleave, respectively, the complementary and non-complementary strands of the target DNA, creating blunt 
ends. Other natural or artificial Cas systems employ different nucleases.^?*495 


introduces a single-strand break, facilitating the insertion 
of new sequences into genomes by homology-directed 
repair. They also showed that multiple guide sequences 
can be encoded into a single CRISPR array to enable the 
simultaneous alteration of several sites.5% 

CRISPR technology has proven far more versatile and 
efficient than the previous methods developed for targeted 


genome modification, which employed nucleases linked to 
the DNA-binding domains of transcription factors whose 
sequence specificity could be engineered by altering amino 
acids in these domains (called ZENs* and TALENS), an 
effective but relatively cumbersome methodology.506-508 


* Involving ‘zinc fingers’, see Chapter 16. 
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Moreover, CRISPR technology has been adapted by 
engineering the Cas9 and other Cas nucleases to inacti- 
vate, manipulate, image and annotate specific DNA and 
RNA sequences in a wide variety of organisms, 495509514 
including high-throughput mutagenesis to identify the 
roles of new genes and genetic pathways in cell and 
developmental biology.?^ CRISPR has also been widely 
used in industrial biotechnology?!6?"7 and in agriculture 
to construct transgenic plants and animals with desirable 
properties.518519 

The toolkit and applications have been advancing 
at breathtaking pace.” For example, nuclease-inac- 
tive and RNA-targeting Cas proteins have been fused 
to a plethora of different effector proteins to alter gene 
expression, epigenetic modifications and chromatin 
interactions.495521-56 These include the fusion of tran- 
scriptional repressors or activators to catalytically inac- 
tive Cas9 to repress ((CRISPRi’) or enhance (“CRISPRa”) 
expression of specific protein-coding and non-coding 
genes (Figure 12.7).5?79! 

They also include the fusion of cytidine deaminase 
or reverse transcriptase to Cas9 by to enable single base 
replacement or the insertion of any desired sequence 
(programmed into the guide RNA) at any specific posi- 
tion in the genome, termed ‘base editing’ and ‘prime 
editing’, respectively. These have been demonstrated 
to reverse damaging mutations at high efficiency in 
cell culture*?4532-535 and (to a limited extent, but with 
increasing success) in vivo including amelioration of 
deafness and amyloidosis,*%6338 which “in principle 
could correct up to 89% of known genetic variants 
associated with human diseases”. CRISPR systems 
have also been adapted to edit RNA with high specific- 
ity??? to correct a range of genetic defects,5%-5% and to 
develop multiplexed assays for high sensitivity detection 
of viral RNAs.54446 

CRISPR is also being deployed to propagate the 
ingression of desirable genes into natural populations, 
first mooted by CF Curtis in 1968 (using transloca- 
tions)**7>48 and again by Austin Burt in 2003 (using 
"site-specific selfish genes" such as homing endonucle- 
ases, group II introns and some types of transposable ele- 
ments).5% To date, effective introduction of ‘gene-drive’ 
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systems based on CRISPR-Cas9 has been achieved in 
yeast,?^9559 fruitfly?! and mosquito,%%235 among others, 
with applications including eliminating vector-borne 
and parasitic diseases, promoting sustainable agriculture 
and curtailing invasive species, albeit with biosafety and 
‘ethical’ concerns,55%5% which in turn have motivated the 
design of tunable and reversible systems.557-560 

Advanced nanoparticle delivery systems are being 
developed in parallel to enable high-efficiency tissue- 
specific delivery in vivo for gene therapy, vaccination, 
cancer treatment and investigative genetic manipula- 
tion.561-56 The efficiency and specificity of genome edit- 
ing continues to improve, with constant factor being the 
flexibility of RNA guides to unlock the potential of tar- 
geting generic effector proteins to specific DNA or RNA 
locations. 

A billion years ago, RNA guides also solved the chal- 
lenges of regulating cell fate decisions and tissue archi- 
tecture in multicellular organisms (Chapters 15 and 16). 
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3 Large RNAs with Many Functions 


Following fast on the heels of the genome projects came 
high-throughput RNA sequencing projects, which, not- 
withstanding the excitement around the RNA interfer- 
ence pathway and small RNA control of gene expression, 
marked the turning point from regarding non-coding 
RNAs as ancillary contributors to mainstream players in 
cell and developmental biology.'? 

The technique of cloning end fragments of RNAs 
as ‘Expressed Sequence Tags’ (ESTs) was developed 
and popularized in the 1990s to identify protein-coding 
genes? and their spliced isoforms.” Because mRNAs 
typically comprise only ~3% of the total RNA in cells 
(tRNAs comprise ~95%) and it was assumed that poly- 
adenylated RNAs are mRNAs; oligo(dT) hybridization 
was used to purify these RNAs and to prime reverse tran- 
scription from their 3' ends. 

Unexpectedly, the large-scale cDNA cloning efforts 
yielded many sequences that lacked protein-coding 
capacity, ^^ which were initially suspected to be deg- 
radation products or DNA contamination during library 
preparation. This led to alternative methods to “filter” for 
protein-coding genes, such as by evolutionary conser- 
vation.? The existence of long 3’UTRs in mRNAs also 
presented difficulties, as the cloned ESTs often did not 
extend into the upstream protein-coding sequences.* 

To circumvent the latter problem strategies were devel- 
oped to increase the representation of internal exons of 
transcripts by priming reverse transcription with random 
oligonucleotides.!8 This approach was used in the Human 
Cancer Genome Project to generate nearly one million 
“Open Reading Frame’ ESTs ((ORESTES’) from cancers 


High-throughput mRNA sequencing provided the necessary infor- 
mation not only for better annotation of gene (exon-intron) structures 
and alternative splicing,** but also for ‘proteomics’, which matches 
amino acid sequences predicted by mass spectrometry of peptides 
generated (usually) by proteolytic digestion with those in mRNA 
open reading frames. Although having a high false-positive rate, 
and complicated by post-translational modifications, proteomics has 
proved useful for identifying the protein constituents of subcellular 
organelles and complexes.5* 

This assumption ignored historical evidence that some mRNAs such 
as histone mRNAs are not polyadenylated? and that polyA-RNAs are 
abundant in human cells.!0-12 

The ‘Mammalian Gene Collection’ initiative (also) sequenced 
cDNAs from their 5' ends but discarded the clone if an AUG start 
codon and a following open reading frame were not identified,!%!7 
thereby excluding noncoding transcripts. 
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and normal tissues.'%-2! Many novel RNAs were identi- 
fied, leading to the suggestion in 2000 that 36,000 human 
genes was likely a “significant underestimate”, even after 
sequences that did not correspond to predicted (protein- 
coding) genes were excluded.? It was later shown that 
approximately half of the ORESTES were differentially 
expressed intronic or intergenic RNAs." 

The confounding problem was the wide dynamic 
range of gene expression, both within and between cells 
in heterogeneous tissues. Of the ~500,000 polyadenyl- 
ated RNA molecules estimated to exist in a human cell, 
at least half are accounted by a small number of highly 
expressed mRNAs.” Over 95% of the RNA species are 
expressed at low levels: in a tissue such as rat liver, for 
example, ~10 species are present at ~10,000 copies, 500 
at ~200 copies, and 15,000 at ~10 copies or less per cell.*4 
The expression of the small fraction (3%) of ‘housekeep- 
ing’ genes is ~30-fold higher than all other transcripts,” 
so the latter can only be reliably observed at high sequenc- 
ing depth or with enrichment strategies.” 

Consequently, studies aiming to comprehensively 
survey the transcriptome used hybridization subtraction 
methods to deplete abundant RNAs and improve the rep- 
resentation of cell- and tissue-specific transcripts.?9?7 To 
reduce interpretation problems, a method termed ‘Cap 
Analysis of Gene Expression’ (CAGE) was developed by 
Piero Carninci, Yoshihide Hayashizaki and colleagues 
using biotinylation of terminal nucleotides to capture 
‘full-length’ RNAP II transcripts? containing both the 5' 
end and the 3' polyA tail.!^2729-? 


PERVASIVE TRANSCRIPTION 


In the early 2000s, Hayashizaki, Carninci and colleagues 
established the influential FANTOME projects,5%* which 
undertook large-scale sequencing of normalized full- 
length cDNA libraries constructed from a wide variety of 
mouse cell types, tissues and developmental stages, with 


d A large fraction of the RNAs in human cells are not polyadenylated, 
although they are often capped.10-1228 

* FANTOM: Functional Annotation of the Mouse (later Mammalian) 
Genome. The successive FANTOM projects introduced many tech- 
nical innovations and have produced a wealth of data, including well- 
annotated transcription start and termination atlases, full-length 
cDNA clones and other valuable resources for the international 
research community.5^^ 
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the objective of characterizing the proteome. However, the 
unexpected and ultimately headline result was the iden- 
tification of over 34,000 long “mRNA-like” (5' capped 
and 3' polyadenylated) transcripts often emanating from 
‘intergenic’ regions, many of which were spliced and 
differentially expressed, but did not appear to code for 
proteins.'+34-37 At least 70% of protein-coding loci were 
also found to express overlapping antisense RNAs, some 
of which were shown to have regulatory function? and 
to be conserved over large evolutionary distances in the 
vertebrates.^" Widespread tissue-specific sense-antisense 
and intronic transcription was also observed across the 
spectrum from yeast to humans.*!~? 

Similar findings were reported using the orthogonal 
method of high-density genome tiling arraysí indepen- 
dently by Tom Gingeras, Mike Snyder and colleagues, 
who showed that transcribed sequences covered far more 
of the human genomes than predicted from protein- 
coding gene annotations,?*5^ with pervasive transcrip- 
tion of coding and non-coding regions in embryonic 
stem cells, becoming more restricted as differentia- 
tion proceeds.°* They also showed that almost half of 
the transcripts in human cells are not polyadenylated,? 
confirming largely forgotten reports from the 1970s and 
1980s.?-? The non-polyadenylated RNAs are derived 
from repeats and introns,??^? often transcribed at high 
levels by RNA polymerase III^,9?77! or from processing of 
RNAPII transcripts. Moreover, the total mass of the 


f Hybridized to cDNAs randomly primed from polyA+ or polyA- 
RNAs. 

g Again described as the “dark matter" in the genome.? 

^ RNA polymerase III (RNAPIII) produces various types of regula- 
tory RNAs from repeat sequences, some of which are clade-specific, 
such as B2 SINE RNAs in mice and Alu RNAs in humans, in both 
cases with modular structures that repress RNA polymerase II dur- 
ing stresses like heat shock at specific loci, a striking example of 
convergent evolution.9-9? Large numbers of human- or primate- 
specific Alu-derived short RNAs transcribed by RNAPIII have been 
identified by bioinformatic and biochemical strategies, including a 
class of structured small («120nt) RNAs (snaRs) complexed with the 
dsRNA-binding nuclear factor 90 family. These are mostly genomi- 
cally clustered and differentially expressed in regions of the brain, 
other tissues and cancer cells.?-9 Recently it has been shown that 
transcription of snaR-A, which produces an miRNA that targets a 
metastasis inhibitor,” is driven by an embryonic isoform of RNAPII 
that is also upregulated in cancer cells, whereas the other isoform 
is expressed in specialized tissues. Other TE-derived RNAs are 
involved in other aspects of genome expression and organization, 
discussed in Chapter 16. For example, bidirectional transcription by 
RNAPII and RNAPIII of the B2 SINE sequence in mice was found 
to restructure the growth hormone locus into nuclear compartments 
and define the heterochromatin-euchromatin boundary to regulate 
the expression of the gene during organogenesis, suggesting a role 
of these abundant elements in the topological organization of the 
genome.9* 
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non-ribosomal, non-protein-coding RNAs in human cells 
and brains was found to exceed that of the mRNAs.7677 

While controversial at first, these findings were con- 
firmed by other studies in animals, plants and fungi using 
a variety of techniques, including additional large-scale 
cloning and sequencing of cDNAs,?^7/7$-3! serial analy- 
sis of gene expression and massively parallel signature 
sequencing?» and microarrays and genome-wide til- 
ing arrays that probe the expression of non-coding reg 
ions.15374652555678,6-89 Over 85% of the Drosophila 
genome was found to be dynamically expressed during 
the first 24h of embryonic development,*%-% over 70% 
of the C. elegans genome to be transcribed in mixed- 
stage populations** and over 85% of the yeast genome 
to be expressed in rich media.” A subsequent intensive 
survey of 1% of the human genome by the ENCODE! 
Consortium showed that at least 93% of the nucleotides in 
the studied regions are transcribed in one or more of the 
11 cell lines analyzed and that most of the unannotated 
transcripts are expressed in just one or a few.% 

Atthe same time, we and others showed that the expres- 
sion of unannotated mammalian long non-coding RNAs 
(IncRNAs) is highly dynamic and tissue-specific (details 
below), being most extensive in brain and testis.?7?* 
LncRNAs are also differentially expressed during devel- 
opment in other organisms.” Some of the non-coding 
transcripts were found to be huge, tens and sometimes 
hundreds of kilobases in length (“macroRNAs"),0%0-1%2 a 
well-characterized example being Air (108 kb) from the 
imprinted [gf2r locus (Chapter 9).!%0,103,104 A more recent 
pan-transcriptome analysis reported the discovery of thou- 
sands of novel RNAs, including a previously poorly cata- 
loged class of non-polyadenylated single-exon IncRNAs.!05 

Widespread antisense transcription is also observed in 
bacteria!% and viruses.!7108 Indeed, global transcriptome 
analyses showed that the vast majority of all nucleotides in 
all genomes from viruses to humans are transcribed from 
one or both strands at some point in their life cycle.!% 


THE AMAZING COMPLEXITY 
OF THE TRANSCRIPTOME 


Sequencing of expressed RNAs and RACEK-tiling arrays 
also revealed that most transcripts in mammals and insects 


ENCODE: Encyclopedia of DNA Elements. 

LncRNAs are defined as non-protein-coding RNAs 2200 nt, an 
arbitrary classification partly based on a size cutoff in biochemical/ 
biophysical commercial RNA purification kits and protocols that 
exclude most infrastructural RNAs, such as tRNAs, snoRNAs and 
snRNAs, as well as miRNAs, siRNAs and piRNAs.?? 

* RACE: Rapid amplification of cDNA ends. 
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have alternative transcription start and termination sites, 
the former often initiating hundreds of kilobases upstream 
of the previously annotated gene starting point(s) and 
spanning other genes in between.?7?.119.!!! Approximately 
250,000 transcriptional start sites in mammals reside 
within transposon or retroviral! derived sequences, which 
account for up to 3096 of the transcribed loci, produce 5' 
capped RNAs that are generally tissue-specific, and fre- 
quently function as alternative promoters of protein-cod- 
ing genes and/or express non-coding RNAs.!? 

Extraordinary complexity of tissue- and lineage- 
specific alternative splicing was also observed,^!^-!!5 
particularly in IncRNAs.*!? LncRNAs appear to be 
less efficiently spliced than mRNAs, a feature that is 
correlated with heightened alternative over constitutive 
splicing,'20-122 possibly related to chromatin retention.?? 
LncRNAs are enriched in transposon-derived elements 
and other repeat sequences,??.2^-1?7 which is likely related 
to their functional modularity (Chapter 16). Thousands of 
‘pseudogenes’ are also transcribed,!%-1! as are develop- 
mental ‘enhancers’, whose numbers far outweigh those of 
protein-coding genes?-!? (Chapter 14). 

Transcriptome analyses progressively revealed the exis- 
tence (and, in most cases, drastically expanded the repertoire) 
of other types of transcripts!* various ‘classes’ of promoter- 
associated RNAs,'^^-^* 3'UTRs (Chapter 9) and regulatory 
RNAs originating from intergenic spacers between rRNA 
genes (in which promoters and transcripts had been known 
for decades — see references in!*-151), as well as other classes 
such as circular IncRNAs (circRNAs)?-^ (see below) 
and intron-derived IncRNAs with “snoRNA-ends” (sno- 
IncRNAs).'? Intron retention was also found to be common 


! Later studies showed that the number of long noncoding RNAs 
expressed from endogenous retroviral promoters correlates with plu- 
ripotency or the degree of malignant transformation.!!? 
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Graphical representation of the complexity of the transcriptional landscape in mammals. (Reproduced from 


in plants and animals, where it is used to control cell differ- 
entiation.+*9%156-163 Surprisingly, even RNAs modified with 
N-glycans and a range of other RNAs are displayed on cell 
surfaces, apparently cell type specifically.!6^165 

The picture that emerged is that eukaryotic genomes 
express a semi-continuum of interlacing and overlap- 
ping coding and non-coding transcripts from both DNA 
strands," especially in animals'^45771-5 (Figure 13.1). 
There are genes encoding proteins, snoRNAs, miRNAs 
and other small non-coding RNAs located within other 
genes encoding proteins and IncRNAs, with unclear 
boundaries, often in nested chains where three or more 
transcripts overlap in “complex loci”, and the landscape of 
expressed transcripts, as well as their promoters, splicing 
patterns and termination points, are different in different 
cells and tissues.* Indeed, the true extent of the reper- 
toire of RNA expression is still unknown, given that most 
analyses to date have been carried out in cultured cells 
and do not capture the fine scale transcriptomes of diverse 
cells during the ontogeny of differentiation and develop- 
ment, although this is rapidly changing with the ubiquity 
of RNA sequencing, including increasingly powerful 
single-cell sequencing analyses of a variety of organisms, 
tissues and conditions (see below and following chapters). 

Collectively, these observations were revolutionary in 
their implications. They challenged both the equivalence 


? A similar albeit less complex genomic organization pertains in pro- 
karyotes, where hundreds of transcriptional start sites are located 
within operons, as well as opposite to annotated genes, indicating 
that the complexity of gene expression is increased by uncoupling 
polycistronic linkages and the genome-wide use of antisense tran- 
scription.!% Apart from the thousands of short regulatory RNAs 
(Chapter 9) there are other, often still mysterious, longer (>200nt) 
highly structured and conserved noncoding RNAs that have been 
discovered in bacteria.!67-170 
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of genes with proteins" and the notion of 'genes' as dis- 
crete entities.72174178 They also suggested the opposite 
of what had been long thought, i.e., that the genomes of 
humans and other complex organisms are information 
dense, not information sparse!’ (Chapter 7). The genome 
could no longer be envisaged as a linear array of protein- 
coding genes and associated cis-regulatory sequences, 
with some infrastructural and idiosyncratic non-coding 
RNAs. Rather, genome biology had to be reimagined as 
a highly dynamic continuum of coding and non-coding 
transcription,!”+178,180 the latter becoming more extensive 
as developmental complexity increases.? Moreover, 
almost every gene is overlapped by an exon or intron of 
another gene expressed in some cell type, and any given 
sequence can be intronic, exonic or intergenic, depending 
on the expression state of the cells. 


PROTEIN-CODING OR NOISE? 


Unsurprisingly, these findings were initially met with 
skepticism. They indicated a massive hidden layer of 
RNAs of unknown function that had not been coun- 
tenanced by the existing models of gene regulation, 
although the precedents had been there for decades, espe- 
cially in the homeotic loci controlling organism develop- 
ment (Chapters 5 and 9). The protein-centric conception 
of genetic information and gene regulation could accom- 
modate a few idiosyncratic regulatory RNAs, and post- 
transcriptional control of gene expression by miRNAs, 
but not tens of thousands of non-coding RNAs. No won- 
der there were reservations. 

One difficulty was discriminating coding from non- 
coding transcripts,'*! leading to the suggestion that some 
or many of the newly identified transcripts might con- 
tain short open reading frames that had fallen below the 
radar of the genome annotations, which had generally 
used a minimum open reading frame of 100 codons.? 
Subsequent studies using ORF conservation, proteomic 
analyses and ribosomal profiling showed that, while 
there may be hundreds of unrecognized short proteins, 


? This created problems for genome annotations, which had been tra- 
ditionally organized around protein-coding genes, although they are 
still used as landmarks." 

? The initial annotation of the human genome generally used the pres- 
ence of a conserved open reading frame of 100 codons in RNAs 
(exonic fragmentation of protein-coding sequences made this assess- 
ment impossible at genome sequence level) as an arbitrary cutoff on 
the basis that this was unlikely to occur by chance. The sequencing 
of large numbers of vertebrate genomes will allow more accurate 
assessment of conserved open reading frames, ultimately with near 
statistical certainty at the codon level. 
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including hormones and small peptides encoded in some 
RNAs annotated as non-coding,?-?7 the vast major- 
ity of IncRNAs exhibit no evidence of protein-coding 
capacity. !85.186,198 

The dichotomy may itself be false: at least some 
RNAs have dual function as both coding and regulatory 
RNAs!33.181,189,194,199-210 and mRNAs appear to play a 
role in cellular organization.?!!?? Some protein-coding 
loci also generate miRNAs or IncRNAs? by alternative 
splicing,?09215-2?! and there is “a non-negligible fraction 
of protein-coding genes (where) the major transcript 
does not code a protein"??? Some regulatory IncRNAs 
have evolved from protein-coding ancestors and pseu- 
dogenes.!78.!31.223-225 Some genes that express IncRNAs 
with enhancer activity also produce miRNAs?6227 and 
micropeptides,59.190208:28279 further examples of par- 
allel outputs from complex loci. Moreover, IncRNAs 
have many alternative splice isoforms^ (Chapter 16), 
some of which have been shown to have different 
functions.230-235 

Skepticism of the significance of non-coding tran- 
scription took many forms, including speculation that the 
unannotated RNAs are technical artifacts, genomic DNA 
contamination or have dispensable functions.!9?236237 
The most common reaction to these findings, however, 
was to assert that the bulk of the observed transcription, 
although dynamic and often cell- or tissue-specific, is 
“noise”: most non-coding transcripts were detected at low 
levels, many are or appeared to be comprised of just one 
exon, and many seemingly ‘random’ fragments abounded 
in initial RNA sequencing datasets, with differential 
expression sometimes attributed to variations in chroma- 
tin accessibility.238-241 

The concept of transcriptional noise was introduced in 
the 1990s by the observation of the cellular heterogene- 
ity and stochastic fluctuations in the firing of known pro- 
moters in bacteria and yeast, not spurious transcription 
from illegitimate initiation sites.?4-?^6 Nonetheless it was 
seized upon, and conflated with ‘neutral evolution”,247 
leading to debates about the functionality or otherwise 
of the plethora of non-coding transcripts in eukaryotic 
cells,23%248251 reprising previous discussions about the 
functionality of introns, transposon-derived sequences 
and pseudogenes. 


P For example, the PNUTS gene encodes both PNUTS mRNA and 
IncRNA-PNUTS by alternative splicing of the primary transcript, 
each eliciting distinct biological functions; PNUTS mRNA is ubiq- 
uitously expressed, whereas the production of IncRNA-PNUTS is 
tightly regulated.?!? 
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THE RESTRICTED EXPRESSION OF 
LONG NON-CODING RNAs 


It transpired that the low-level and fragmentary signals 
from IncRNAs in sequencing datasets is mainly a con- 
sequence of their highly developmental stage-specific 
expression, exacerbated by insufficient sequencing 
depth,‘ especially in complex tissues.22245252255 

The expectation that high expression levels reflect 
functionality is based on the prevalence of protein-coding 
RNAs, which are, on average, more highly expressed than 
regulatory RNAs, although there are exceptions.?54^255 
Indeed, mRNAs encoding regulatory proteins such as 
transcription factors are usually expressed at lower levels 
than those encoding structural or metabolic proteins, and 
have shorter half-lives.256257 

Regulatory RNAs would likewise require relatively 
low average expression levels, i.e., more localized expres- 
sion, and more dynamic control.25*25 Examples, among 
many others, include functionally validated chromatin- 
associated IncRNAs detected on average in less than 
ten copies per cell in populations.25265 Single-molecule 
RNA FISH revealed that the localized TERT (telomerase 
reverse transcriptase) pre-mRNA occurs in 9-10 copies 
per cell and is only spliced during mitosis.!%266 Similar 
low expression levels are also observed for a number of 
functionally well validated regulatory RNAs,? includ- 
ing XIST (Chapter 16). 

Many smaller “cryptic unstable transcripts" expressed 
antisense from promoters and from intra- and intergenic 
regions are rapidly degraded by RNA turnover and sur- 
veillance" pathways,*”286-288 which were thought to be 
a quality control mechanism to limit "inappropriate 


3 This problem also confounds the attempts to construct 'gene net- 
works' from transcriptomic data, especially when different cells in a 
population are expressing different genes and responding differently 
to stimuli, and most data is derived from the 3' end (UTRs) of tran- 
scripts, which may or may not be part of a corresponding protein- 
coding mRNA.” 

* Via RNA-degrading *exosomes"?97?9? (not to be confused with the 
extracellular vesicles that have the same name) and *nonsense-medi- 
ated RNA decay’ (NMD). NMD is known as a quality control mech- 
anism to ensure only mRNAs with complete open reading frames 
are exported for translation, by degrading “aberrant RNAs" that have 
"exon junction complexes" 3' to stop codons (because the stop codon 
of protein-coding genes is located primarily in the last exon).?70-276 
However, recent evidence suggests that it also distinguishes short- 
lived regulatory RNAs from mRNAs and controls their steady-state 
levels in different contexts,?7728 including stress responses.??! The 
NMD pathway is developmentally regulated,?”*? and required for 
embryonic stem cell fate determination?9?** and neuronal architec- 
ture.2% Loss of NMD components leads to developmental abnormal- 
ities and neurological disorders.?/6255 
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expression ,??6 but were subsequently found to regulate 
promoter and enhancer function (Chapter 16).29?297 

Similarly, other transcripts associated with promot- 
ers were detected only after depletion of components of 
RNA-degrading ‘exosomes’, including “promoter-associ- 
ated RNAs’ (PASRs) of -250—500nt identified in yeast 
and human cells,!!^^ “promoter upstream transcripts’ 
(PROMPTs) of -0.5-2.5kb, identified in human cells 
in both sense and antisense orientations upstream of the 
transcription start sites of expressed genes. Exosomes 
have also been shown to control the levels of “a vast num- 
ber" of IncRNAs with enhancer activity in B cells and 
pluripotent embryonic stem cells.2% 

Analysis of in situ hybridization patterns of over 
1,000 IncRNAs showed that a high proportion exhibit 
precise expression patterns in the brain/central nervous 
system, easily detected in highly specific and localized 
populations of cells in the striatum, retina, hippocampus, 
cerebellum, olfactory bulb and cortical layers, among 
others.??90? Complex region- or cell-specific and devel- 
opmentally transient expression patterns of ‘intergenic’ 
and antisense IncRNAs have also been observed in fish,3% 
Drosophila,?9^3 honey bees30%-308 and other multicellu- 
lar and unicellular eukaryotes?? (Figure 13.2). It has also 
been observed in globin!?*909!! and interleukin loci,?!?2? 
now being extended to others by detailed examination of 
more tissues and developmental time points, aided by the 
advent of single-cell sequencing.?!43!8 

To get around the problem of low sequencing depth, 
John Rinn, Tim Mercer and colleagues developed a 
method called ‘RNA CaptureSeq'? akin to exome 
sequencing, to enrich transcripts expressed from specific 
genomic locations.22253,326 This approach showed that, 
even in a relatively homogenous population of cultured 
fibroblasts, regions that appeared devoid of transcripts 
— sometimes referred to as ‘gene deserts’ — expressed 
IncRNAs in a subset of the cells.?? It detected previously 
unknown isoforms of intensively studied protein-cod- 
ing genes, such as TP53, and IncRNAs expressed from 
homeotic and other developmental loci.252253.326-328 ft also 
revealed that most GWAS regions, including those asso- 
ciated with neuropsychiatric functions, are transcribed 
into IncRNAs.??-?! Other studies showed that most inter- 
genic IncRNAs originate from enhancers and are specifi- 
cally expressed in cell types relevant to the associated 
GWAS trait???-333 (see below and Chapters 14 and 16). 

RNA CaptureSeq also revealed that many of the exist- 
ing annotations of IncRNA (and some mRNA) structures 


* Along with sophisticated internal standards to properly measure the 
sensitivity of DNA and RNA sequencing analysis.520-325 
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FIGURE 13.2  Reflective in situ hybridization patterns of the expression of the protein-coding gene ultrabithorax (Ubx) and its 
overlapping antisense RNA in embryos of the centipede Strigamia maritima at mid- and late-segmentation stages. (Reproduced 
from Brena et al.?? with permission of John Wiley and Sons.) The arrow marks the anterior margin of the first leg bearing seg- 


ment (LBS). 


were incomplete,' that many IncRNAs are multi-exonic, 
and that the internal exons of IncRNAs (but not mRNAs) 
are almost universally alternatively spliced.^ These high 
resolution and recent single-cell RNA sequencing studies 
also indicate that the number of IncRNAs and IncRNA 
isoforms that are expressed is far greater than cataloged 
in current databases. 05554 

Thus, IncRNAs generally show more tissue- 
restricted and transient expression patterns than 
mRNA s,16.155254500,31835337 helping to explain their low 
representation in RNA sequencing datasets (Figure 13.3). 
This in turn suggests that IncRNAs are more specific 
markers and regulators of cell state, including disease 
state, than proteins with generic functions in, e.g., mus- 
cle, bone or neuronal cells. 


OTHER INDICES OF FUNCTIONALITY 


LncRNAs are dynamically expressed during all aspects 
of animal differentiation and development" along develop- 
mental axes,8%1322603% in embryonic stem cells,58:1245341,342 
neuronal cells,2535-5 muscle cells,*4° mammary 


These studies also showed that there are far more regulatory 5' exons 
in human than in mouse mRNAs, which are also highly alterna- 
tively spliced, suggesting that humans have evolved a more complex 
cis-regulatory architecture of mRNAs,* possibly related to brain 
function. 

“ Including in sponge.?^? 


gland,** hematopoietic and immune cells,*”*48? among 
many others.??5*»? They are also differentially expressed 
in neurological responses, for example, in the songbird 
zebrafinch where “40% of transcripts in the unstimu- 
lated auditory forebrain are non-coding and derive from 
intronic or intergenic loci.. Among the RNAs that are 
rapidly suppressed in response to new vocal signals ... 
two-thirds are ncRNAs”.% LncRNAs show altered 
expression in cancer and other diseases?? (see below) 
and are also dynamically expressed during plant and 
fungal development.*!1-353355 LncRNAs are also traf- 
ficked to specific subcellular locations in the nucleus 
and cytoplasm, and specific domains within them 
(Chapter 16).28.300,303,307,336,356-372 

The half-lives of IncRNAs are broadly similar to those 
of mRNAs, over an equally wide range, many being 
highly stable.?57373:374 In fact, the loci expressing IncRNAs 
exhibit most of the characteristics of bona fide genes??? 
their expression is regulated by conventional hormones, 
morphogens and transcription factors;*++376382 many have 
polyadenylation sites; they show non-neutral muta- 
tional patterns;*** their promoters and exons have chro- 
matin marks similar to those of protein-coding genes? 
(Chapter 14); and their splice junctions and structures are 
conserved,+1%584+-3% allowing the identification of ortho- 
logs in other species.?57-58? 

Surprisingly, the promoters of IncRNAs are, on 
average, more conserved than those of protein-coding 
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genes, '+384390 suggesting higher cell specificity, consis- 
tent with their restricted expression and the conserved 
expression patterns of syntenic RNAs.??!?* Although the 
proportion of IncRNAs with primary sequence similar- 
ity is low among vertebrates,?*^3*? thousands of RNAs in 
mammals, Drosophila, plants and yeast have conserved 
secondary structures and sequence motifs, and a mini- 
mum of 2096 of the mammalian genome has been shown 
to be under evolutionary selection at the level of predicted 
RNA structure.?56394-401 

Lack of conservation does not mean lack of func- 
tion.*% There are well-described examples of IncRNAs 
(including Drosophila roX RNAs) that evolve rapidly 
while maintaining functional interactions.*02-40% Xist 
shows only patchy primary sequence conservation 
among mammals, notably in its ‘repeat’ sequences, and 
its adjacent IncRNA, Jpx, which activates Xist, while 
sharing no obvious sequence or structural homology 
between human and mouse, is functionally interchange- 
able between them.494405, 
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Many other well-studied IncRNAs involved in devel- 
opmental processes, including Air," $ DISC2,407 NTT,*8 
BORG*” and UM 9(5), show only short stretches of con- 
served sequences.^? In addition, many IncRNAs have 
conserved functions in vertebrate development despite 
rapid sequence divergence*!! and orthologous IncRNAs 
that are developmentally regulated in different species 
have been identified solely on the basis of the conserva- 
tion of splice sites and associated introns.?*7 This indi- 
cates that IncRNAs have greater orthology than is evident 
from conventional sequence comparisons, reflecting posi- 
tive selection for phenotypic variation+!? and more plastic 
structure-function relationships^^^' than protein-coding 
sequences.?.29? 

Furthermore, bearing in mind the difficulty of identi- 
fying orthologs that are evolving to alter the fine control 
of developmental processes, many IncRNAs are clade- 
specific and, consequently, largely unstudied. One exam- 
ple is the IncRNA Sphinx, which regulates courtship 
behavior in Drosophila and is expressed from a chimeric 
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FIGURE 13.3 In situ hybridization of IncRNAs in mouse brain. (Original images from the Allen Brain Atlas,833 reproduced 


from Mercer et a1.3%) 
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gene that arose by the retrotransposition of a sequence 
from an ATP synthase gene and capture of an adjacent 
exon and intron, fixed by positive selection.^^ There are 
many other examples of species-specific IncRNAs, with 
evidence of recent birth and selective sweeps, control- 
ling cell differentiation (e.g., ^) and brain functions (see 
below and Chapter 17). 


GENETIC SIGNATURES 


Another reservation was that few IncRNAs had, at the 
time, been identified in genetic screens, which intrinsi- 
cally favored protein-coding mutations.!79375 

Protein-coding mutations are frequently disastrous, 
including those affecting enzymes, motor proteins, trans- 
porters, signaling proteins, etc., as well as transcription 
factors, epigenetic modifiers and other regulatory pro- 
teins, which cause system-wide malfunctions. The same 
holds for some highly expressed non-coding RNAs with 
generic functions,*" such as RMRP (the RNA component 
of RNase MRP, Chapter 8), mutations in which cause a 
pleiotropic human disease, cartilage-hair hypoplasia, first 
identified by linkage analysis^5^? and later shown also 
to produce miRNAs,^? to perturb helper T-cell epigen- 
etic regulation?! and to inactivate the tumor suppressor 
P33,22 

A major blind spot was phenotypic bias: the severe and 
pleiotropic effects of damage to proteins or ‘housekeeping’ 
RNAs contrast with damage to regulatory sequences, which 
may only affect a part of the networks that control differ- 
entiation and development or environmental responses, 
with more subtle context- and/or cell type-specific conse- 
quences, often referred to as quantitative trait variation. 
Indeed, the use of the word ‘mutation’, as opposed to ‘varia- 
tion’, reflects an inherent bias in the identification of genetic 
factors that affect phenotypes in animals and plants, with 
those exhibiting strong negative effects (being easier to 
identify and map) understandably having taken precedence 
over those that do not (Chapters 7 and 11). 

The related blind spots were expectational, technical 
and interpretative bias: historically, most genetic screens 
used experimental and informatic approaches that priori- 
tized protein-coding genes and exons. Many now known 
important mutations in IncRNAs were consequently 
missed by exome sequencing and chromosomal micro- 
array analyses.?? Mutations that could not be tracked to 
and shown to introduce stop codons or different amino 
acids in a protein-coding sequence were rarely pursued, 
given the large number of variations in non-coding 
sequences, which were mostly invisible and untrace- 
able before the availability of genome and transcriptome 
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sequences" Even those that were confidently mapped 
outside of protein-coding sequences were routinely inter- 
preted as affecting cis-acting protein-binding sites that 
regulate nearby coding genes. Put simply, it was assumed 
that most disease-causing mutations occur in protein- 
coding sequences, where it was easy to identify them, or 
in cis-regulatory sequences that bind regulatory proteins, 
with scant knowledge of regulatory RNAs that might be 
expressed from the locus, some of which are now being 
identified.233.425-432 

In Drosophila, where careful genetic analysis 
identified many enhancers and other regulatory regions 
affecting development," the same interpretative bias 
occurred, despite the abundant evidence of differential 
expression of IncRNAs from these regions (Chapter 5), 
although there were exceptions, such as the roX RNAs 
identified by careful genetic and expression mapping 
(Chapter 9),433.434 

There were other exceptions, especially in farm ani- 
mals, where controlled breeding permitted accurate 
dissection of the genetic causes of quantitative trait 
variation. The mutation underpinning the ‘callipyge’ 
polar overdominance phenotype in sheep (Chapter 5) 
was mapped to a non-coding RNA expressed from the 
complex DIk1-Dio3 imprinted region,*?-*5 as were oth- 
ers affecting quantitative trait variation.** Similar strong 
effects are observed for mutations in other IncRNAs (see 
below) and non-coding regulatory regions, as exemplified 
by the Crest mutation in chickens, which causes a spec- 
tacular phenotype in which the small feathers normally 
present on the head are replaced by much larger feath- 
ers normally present in dorsal skin. It was shown to be 
caused by a 197 bp duplication of an evolutionarily con- 
served sequence in the intron of HoxC10, which causes 
the ectopic expression of HoxC10 and other Hox genes, 
altering cell regional identity (Figure 13.4).4#° 

Indeed, many IncRNAs that are now known to be 
important went undetected or were overlooked in genetic 


Y There are few promoters in the catalogs of mutations associated with 
genetic disorders,** although no one disputes their functionality. 

w An important subtlety, and difference between genetic screens in 
Drosophila and mammals is that many if not most naturally occur- 
ring mutations and those experimentally induced in the latter (by 
the mutagen ENS) are single nucleotide mutations, which can have 
serious consequences on a protein-coding sequence, but often subtle 
consequences on regulatory sequences. By contrast, most experi- 
mental mutagenesis in Drosophila involved transposable element 
insertion or large deletions, which have more serious phenotypic 
effects on both coding and regulatory sequences, and hence it is no 
surprise that so many regulatory loci were unearthed in bithorax and 
other intensively studied gene regions in Drosophila. 
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FIGURE 13.4 The ‘crest’ phenotype resulting from a 197bp duplication in the intron of HoxC10, which alters cell identity. 
(Reproduced from Li et al.,** under Creative Commons 4.0 license.) 


screens. These included nearly all miRNAs,* many 
of which are not individually essential for viability or 
development;*? a conserved IncRNA (‘yar’) that lies 
within Drosophila Achaete-Scute complex locus stud- 
ied by Muller and others many decades ago, which was 
recently discovered to regulate sleep behavior; and 
3'UTR of the oskar gene that was unexpectedly found to 
function as a IncRNA controlling Drosophila oogenesis 
(Chapter 9).+44445 

Moreover, other IncRNAs initially thought to be 
non-essential, many of which display little ‘conser- 
vation’ and are lineage-restricted (including human- 
specific RNAs), have been implicated in disease,*#°-+9 
with more coming to light with the growing awareness 
of their relevance to complex traits.*%% These RNAs 


* It has been proposed that miRNAs operate in a hierarchical and 
canalized series of regulatory networks (see Chapter 15), a frac- 
tion of miRNAs acting at the top of this hierarchy, with their loss 
resulting in broad developmental defects, whereas most miRNAs are 
expressed with high cellular specificity and play roles at the periph- 
ery of development, affecting the terminal features of specialized 
cells.*#! It is likely that the same applies to IncRNAs. 


have been identified by chromosome breakpoints, 
fusions and translocations, deletions, copy number 
variations, point mutations, insertions and deletions, 
aberrant imprinting, other epigenetic defects and haplo- 
insufficiency, among others.^? 

Examples, some exhibiting Mendelian inheritance, but 
most involved in complex disorders, include Di George 
Syndrome (a range of symptoms including congenital 
heart problems, unusual facial features, frequent infections, 
developmental delay, learning problems and cleft palate);*2 
other neurodevelopmental and craniofacial disorders;#3-*> 
developmental defects, e.g., involving the IncRNA Chaserr, 
which regulates the expression of a chromatin remodeler 
implicated in neurological disease (Figure 13.5);*! limb 
malformations, brachydactyly and other skeletal abnor- 
malities (Figure 13.6);%0456-%5 Angelman” and Prader- 
Willi Syndromes;*9?-^9 schizophrenia; *07494-466 Kallmann 


Y Therapies are being developed for Angelman's Syndrome by knock- 
ing down the regulatory Ube3a-ATS non-coding RNA with antisense 
oligonucleotides to restore expression of the normally silent (imprinted) 
paternal Ube3a allele in patients lacking the maternal allele.*? 
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FIGURE 13.5 The severe phenotype resulting from haploinsufficiency of the IncRNA Chaserr. (Reproduced from Rom et al.*! 


under Creative Commons CC BY license.) 


FIGURE 13.6 Skeletal malformations due to the loss of IncRNA Maenli. (Reproduced from Allou et al.9? with permission from 


Springer Nature.) 


Syndrome (a subtype of gonadotropin-releasing hormone 
deficiency with a loss of smell),^9" pseudohypoparathyroid- 
ism;+% alcohol use disorders? nonalcoholic fatty liver 
disease;"? diabetes;"! multiple sclerosis; autoimmune 
thyroid disease;*7? Sjogren syndrome;* celiac disease; *^ 
Hypereosinophilic Syndrome;*” Kawasaki Disease;?! 
psoriasis;77^5 inflammatory bowel disease;*” athero- 
sclerosis;73!79480481 cardiac hypertrophy,**? Alzheimer’s 
Disease;* ataxias;*84485 myocardial infarction;*% and some 
types of thalassemia.!3!487488 

It has also been shown that phenylketonuria, one of 
the first documented human genetic disorders, which is 
mostly due to mutations in the enzyme phenylalanine 
hydroxylase, can also be caused by perturbations in a 
regulatory IncRNA and be modulated by administration 
of modified RNA mimics in mouse models.^9? 


As noted already, genomic regions associated with a 
wide variety of complex disorders and characteristics, 
including psychiatric traits and disorders and neurode- 
generative diseases, are replete with IncRNA s,??9.530,332.533 
which are therefore candidates for the mechanistic basis 
of the association. Sensibly, studies have started to focus 
not only on non-coding regions, but also on mutations/ 
variations that affect RNA structure.“ 

Many IncRNAs have been also associated with the etiol- 
ogy, progression, genomic instability and therapy resistance 
of cancers,9049196 through altered expression, inser- 
tional mutagenesis and/or naturally occurring mutations, 
and functional validation of previously known and novel 
RNAs that act as oncogenes’ or tumor suppressors, many of 


* The IncRNA Cherub is required for the transformation of stem cells 
into malignant cells.*” 
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which are enriched in repetitive elements.*% In some cases 
(such as H19, PVTI, MIAT/Gomafu, OIPS-AS1/Cyrano, 
TUG1,** HOTAIR, MEG3, XIST, TSIX, MALATI and 
NEAT!), perturbations in IncRNAs are associated with 
multiple cancers;735390494499-519 and in other cases with par- 
ticular types of cancers, including leukemias and lympho- 
mas,55520-55 melanoma,26327 osteosarcoma,?? gastric, 
lung,?? breast? 5* prostate,!3533335 bladder? and many 
others, including bone metastasis, with increasing under- 
standing of the mechanisms involved.505557-5% 

A major problem is that many if not most mutations in 
IncRNAs are cell type- and context-dependent,**% and not 
evident in fish tanks or mouse cages? unless subjected 
to specific challenges or behavioral assays. Indeed, in 
Drosophila and C. elegans, both intensively studied, less 
than a third of protein-coding genes have obvious pheno- 
types when mutated,**15 and many apparently disruptive 
mutations in human genes are common in the population, 
indicating more subtle interactions.5%5% Targeted knock- 
out of the rodent-specific and highly expressed brain 
IncRNA, BCI (Chapter 8), yielded no obvious develop- 
mental phenotype, but causes behavioral changes, and 
impaired experience-dependent plasticity and learning in 
mice.>45>46 Deletion of the highly expressed and relatively 
highly conserved IncRNAs, Neatl and Malatl, similar to 
that observed with ultraconserved elements (Chapter 10), 
did not result in dramatic developmental deficiencies,*+7548 
but later analyses showed changes in behavior and placen- 
tal biology, as well as involvement in synapse formation, 
myogenesis, cancers and responses to pathogen infec- 
tions.5%-55% The loss of another IncRNA, FosDT, which is 
highly expressed in the cerebral cortex and interacts with 
chromatin-modifying proteins associated with the neuro- 
nal transcription factor REST, causes no developmental 
anomalies but reduces brain damage from strokes.554555 
The IncRNA Pnky regulates neuronal differentiation and 
its deletion affects postnatal cortical development, although 
the mice do not exhibit superficial defects.556557 

On the other hand, deletion of other IncRNAs in mice 
by Rinn and colleagues, with visual markers that revealed 
exquisite expression patterns, resulted in a range of more 
obvious phenotypes including homeotic transformations, 
skeletal and neuronal abnormalities, heart and gastroin- 
testinal defects, muscle wasting, abnormal lung morphol- 
ogy and aging.?*5* Many more have since been reported 
using various in vivo approaches.^^? 

High-throughput siRNA and CRISPR reverse genetic 
screens combined with molecular phenotyping3% are now 


a TUGI is required for mouse retinal differentiation?* and male 
fertility.?* 
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increasing the search speed, identifying, for example, 
IncRNAs that are involved in chromatin interactions,?6! 
required for heart development,* regulate nuclear fac- 
tor trafficking,*% activate resistance to BRAF inhibitors 
in melanoma,*%* respond to Wnt signaling? sensitize 
glioma cells to radiation and essential to cancer cell 
viability,* involved in cell growth and migration,560%5% 
spermatogenesis,*% lung cancer“ or have various fitness 
effects.57 A recent large-scale study using CRISPR muta- 
genesis of over 16,000 IncRNAs in seven cell lines identi- 
fied almost 500 required for normal cellular proliferation, 
89% of which were expressed in only one cell type.?”! 

A systematic high-throughput loss-of-function analy- 
sis of 248 protein-coding genes and 141 IncRNAs using 
the fission yeast S. pombe, assessing mutant growth and 
viability in “benign” and 145 variable conditions, showed 
that phenotypes are found much more frequently for the 
former compared to the latter (47.596 for the IncRNAs 
and 96% for the protein-coding genes). However, on more 
careful inspection (also evaluating the effect on cell-size 
and/or cell-cycle control), 59.6% of IncRNAs yielded phe- 
notypes and, upon overexpression of the IncRNAs under 
47 different conditions, 90.3% led to altered growth under 
certain conditions. These results reinforce the notion that 
most of the IncRNAs exert cellular functions in specific 
environmental or physiological contexts.??? 

Intriguingly, it appears that some gene deletions 
may be masked by compensatory mechanisms, whereas 
acutely disturbing transcript levels may have more severe 
effects." There is also evidence that ‘shadow’ enhanc- 
ers (which express and likely operate through IncRNAs, 
Chapter 16) provide redundancy and robustness to devel- 
opmental programs (Chapter 15)?7- 5 which makes 
sense given the criticality of the process for survival and 
reproductive success. On the other hand, knockdown of 
IncRNA expression in culture often has visible effects, 
in terms of changes in cell shape, behavior and gene 
expression profiles,?759? which may have stronger mani- 
festations in artificial in vivo settings such as xenograft 
models.565 


AN AVALANCHE OF LONG 
NON-CODING RNAs 


With the popularization of unbiased transcriptomic stud- 
les, growing numbers of non-coding RNAs have been 
identified and studied in model organisms, human cell 
lines and disease systems, mainly ad hoc by the differ- 
ential expression of intronic, intergenic, pseudogene- 
derived and antisense IncRNAs, but also increasingly by 
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FIGURE 13.7 The increase in the number of publications that have the terms ‘long/large non(-)coding RNA’ or variations 


thereof (IncRNA or lincRNA) in their PubMed entry. 


functional screens. The ENCODE project alone found 
over 850 pseudogenes that are "transcribed and associ- 
ated with active chromatin"^? with many since been 
shown to function as regulatory RNAs. 0522579584 

In addition to viroids that have small circular genomes 
(Chapter 8), circular RNAs (circRNAs) occur in plant and 
animal cells, one of the first discovered being a variant 
of the Sry mammalian male sex-determining gene tran- 
script,>* as ever considered an interesting oddity at the time. 
CircRNAs remained under the radar because traditional 
cloning and sequencing protocols and informatic methods 
mitigated against their detection. They are predominantly 
produced by back-splicing facilitated by reverse-com- 
plementary (often recently acquired transposon-derived 
sequences) that promote pre-mRNA folding.* Since their 
rediscovery, circRNAs have become recognized as a bona 
fide class of functional RNAs, in many cases comprising 
the dominant transcript isoform.152.54587588 

CircRNAs are predominantly nuclear?" and act through 
several mechanisms, having been shown to regulate, 
inter alia, transcription, immune responses, behavior, 
neural cell function and pluripotency,*9???* Drosophila 
lifespan??^ and centromeric chromatin organization in 


a Containing exons,* or a mixture of intronic and exonic sequences.5% 
Some are derived entirely from introns, and have been shown to reg- 
ulate their parent protein-coding genes.^?! 


maize.?* Many circRNAs have regulatory interactions 
with the cognate protein-coding gene, as illustrated by 
the neuronal-enriched psychiatric-disease associated cir- 
cHomerl RNA and its host gene Homerl. Based on in 
vitro and in vivo studies with mouse models, they were 
found to be functionally antagonistic in synapses of the 
orbitofrontal cortex, with this opposing interplay regu- 
lating synaptic gene expression, cognitive flexibility and 
behavioral performance, with potential relevance for 
brain function and psychiatric diseases.??^ 

Over the past decade, there have been ~50,000 pub- 
lications with long non-coding RNA as a key term 
(Figure 13.7) and over 2,000 publications reporting vali- 
dated long non-coding RNA functions.*” These studies 
have been assisted by the systematic cataloging and anno- 
tation of IncRNAs by the GENCODE consortium,* The 
Cancer Genome Atlas (TCGA)?6 and the extension of the 
FANTOM projects to transcriptomic atlases, associated 
Web resources and functional annotation of IncRNAs.*4 

There are now hundreds of thousands of cataloged 
IncRNAs and dozens of databases (and databases 
of databases) with curated information.9? Well over 


ae Initially formed as part of the pilot phase of the ENCODE project??? 
but expanded to annotate "human and mouse genes and transcripts 
supported by experimental data with high accuracy, providing a 
foundational resource that supports genome biology and clinical 
genomics",177336.509 
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100,000 human IncRNAs have been recorded,%! many 
of which are specific to the primate lineage,%5%%% includ- 
ing retrovirus-derived IncRNAs,9? a vastly incomplete 
catalog due to the still limited analysis of different cells 
at different developmental stages and physiological 
conditions. 


A PLETHORA OF FUNCTIONS 


LncRNAs have been shown to regulate many aspects 
of mammalian development, cell differentiation, 
(Figure 13.8) physiology and brain function6056% (see 
Table and Chapter 17), as well as many other roles in other 
organisms, 4607-61? including the translation of Doublesex 
in Drosophila, ^ female honeybee development,*% plant 
vernalization (see below), strawberry fruit ripening,*!! 
DNA elimination and genome rearrangements in cili- 
ate life cycle and reproduction;?? the mitotic to meiotic 
switch?? and meiotic chromosomal pairing in yeast, 6146/5 
and carotenoid biosynthesis in filamentous fungi. 
Mechanistic details have started to emerge, some ser- 
endipitously. As with the discovery of the 7SL RNA in 
signal recognition particles in the early 1980s (Chapter 8), 
biochemical assays used to identify protein interactions 
detected other regulatory RNAs, such as the identifica- 
tion by yeast two- and three-hybrid screens of mamma- 
lian SRA RNA as a transcriptional coactivator, Gas5 RNA 
as a repressor of glucocorticoid receptor activity, and the 
plant ENOD40 RNA as a regulator of the localization of 
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an RNA-binding protein in cytoplasmic granules./38741 
Other IncRNAs have been found to associate with cell 
membranes to alter their permeability and dynamics, 
thereby modulating signal transduction and transport 
pathways,“ including the reprogramming of glucose 
metabolism.” 

Functionally characterized examples have estab- 
lished that IncRNAs participate in virtually all lev- 
els of genome organization and gene expression, via 
RNA-RNA, RNA-DNA and RNA-protein interactions, 
often involving repeat elements within them, including 
SINEs in 3'UTRs. These encompass the regulation 
of transcription, chromatin architecture and the organi- 
zation of subcellular domains (see Chapter 16), control 
of protein translation and localization,361;563;661,748-750 
splicing?$.65:5179 and other forms of RNA processing, 
editing, localization and stability.76?767 

Some of the first characterized non-coding RNAs were 
found to regulate transcription by modulating RNA poly- 
merase II activity, directly (such as RNAs from B2 SINE 
and Alu elements) or indirectly through interaction with 
transcription factors.7576 7SK, for instance, acts primarily 
by sequestering and inactivating the transcription elonga- 
tion factor b (P-TEFb), a heterodimer composed of cyclin- 
dependent kinase 9 (Cdk9) and cyclin T1, which connects 
transcription to the cell cycle and chromatin architec- 
ture.770-777 Jt controls stress-induced transcriptional repro- 
gramming”” and regulates several aspects of the expression 
not just of mRNAs, but also of snRNAS, bidirectional and 


Examples of Functions of IncRNAs in Mammalian Biology 


Stem cell pluripotency, self-renewal, lineage commitment and 
reprogramming?0315.617-629 

Cell cycle,%5 proliferation and migration560;567,636-6%0 

Brain evolution,” neocortex, forebrain,9^^9 and retinal198,646,647 


development 


Synaptic plasticity and function®>°° 

Mammary*7 sclerotome,6% heart, Jung,°”° ske]eta], 457.455 limb+0 
and intestinal”! development 

Liver regeneration6?!.6?? 

Angiogenesis®* and fibrogenesisó% 

Hematopoiesis,” granulocyte,** megakaryocyte,*!® T-cel]^7! 472.700 
and keratinocyte”! differentiation 

Innate and adaptive immune responses/0975 


Microbial susceptibility, endotoxic shock and immunity* 4717-721 


Growth hormone and prolactin production??? 
Testis development and spermatogenesis!?4265.731733 


Thermogenic adipocyte regulation? 


Epithelial-mesenchymal transition?!3536,630,631 

mesoderm and endoderm differentiation$??-65 

Cellular senescence?** and apoptosis?79.476641 

Maintenance of neural progenitor cells,*% neuronal 
differentiation, ?96607.048-650 outgrowth and regeneration, 97.651.652 
axon integrity,° myelination*%* 

Memory,°°! sex-specific depression and social hierarchy®* 

Muscle differentiation and function,?%072-68 myogenesis and 
muscle fiber type switching9?!.655-690 

Cholesterol biosynthesis and homeostasis */0693 

Formation of vascular endothelial cell junctions?7695 

Erythropoiesis and developmental regulation of globin gene 
expression311:704-707 

Inhibition of viral replication! 

V(D)J and Ig class switch recombination’? 

inflammation and neuropathic pain *?9475.479.7247728 

Glucocorticoid resistance??? 

DNA damage repair?! 774755 


Mitochondrial function??? 
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FIGURE 13.8 Control of cell differentiation and self-renewal by IncRNAs. (Reproduced from Flynn and Chang with permis- 


sion of Elsevier.) 


enhancer RNAs,” and acts as a multi-functional RNA 
scaffold that regulates neuron homeostasis.780 
Genomically associated RNAs have been shown to reg- 
ulate gene expression by other mechanisms. Transcripts 
spanning the cyclin D1 (CCNDI) regulatory promoter 
sequences recruit and allosterically regulate the TLS 
RNA-binding protein and induce chromatin modification 
to repress CCNDI expression?! At the DHFR locus, a 
non-coding RNA initiated from an upstream minor pro- 
moter forms a stable RNA-DNA triplex within the major 
promoter to repress DHFR expression."? Likewise, among 
others,'8 the IncRNAs ANRIL (CDKN2B-AS1), Khpsl 
and CISAL regulate expression of the cyclin-dependent 
kinase inhibitor CDKN2B, the proto-oncogene SPHK1 
and the tumor suppressor BRACI, respectively, via triplex- 
mediated changes in chromatin structure.*81,784-786 
Non-coding RNAs from intergenic spacers and 
promoter regions of rDNA genes in humans estab- 
lish and maintain heterochromatin structure at spe- 
cific rDNA promoters via the recognition of RNA 


secondary structures and formation of triplexes with 
target DNA sequences that recruit DNA methyltrans- 
ferase DNMT3b and more.!%-15! Indeed, IncRNAs play 
central roles in the formation and function of hetero- 
chromatic domains, including telomeres?! and centro- 
meres, in all eukaryotes. 

Many IncRNAs associate with enzymes and com- 
plexes that impart histone modifications and DNA 
methylation.97997? Both small and long non-coding 
RNAs control the target specificity of and the inter- 
play between repressive Polycomb group (PcG) and 


ad Maintenance of telomeres, in addition to the telomerase RNA 
component TERC (Chapter 9) involves transcripts named TERRA 
(telomeric repeat-containing RNAs) transcribed from subtelomeric 
regions in a developmentally regulated fashion. TERRA RNAs con- 
tain repeat rich sequences that form G-quartet structures and regu- 
late local heterochromatin stability, telomerase activity and telomere 
length, as well as biological processes such as the induction and 
maintenance of pluripotency.7%7% Although first discovered and 
characterized in mammalian cells, analogous RNAs have similar 
functions in different organisms, including fungi.?24795 


Large RNAs with Many Functions 


activating Trithorax group (TrxG) protein complexes 
(Chapter 16), chromatin modifiers that maintain silent 
and active expression states of genes during develop- 
ment?9.178.796,798-800 (Chapter 14). 

For example, it has been shown that the mouse 
chromodomain-containing PRC1 component Cbx7 
binds RNA and that its association with the inactive 
X chromosome depends on interaction with RNA.*?! 
Imprinting of loci by IncRNAs such as Air and 
Kenglotl (Chapter 9) and X-chromosome epigenetic 
silencing by Xist-locus derived RNAs involve recruit- 
ment of PcG and other chromatin-modifying complexes 
(Chapter 16). The ~3.8kb IncRNA ANRIL, some of 
whose exons are primate-specific,** is transcribed from 
a GWAS region associated with autoimmune and other 
disorders”?!475426.429 and recruits components of the 
Polycomb Repressor Complexes 1 and 2 (PRC1/2) to 
epigenetically silence INK4B/ARF/INK4A tumor sup- 
pressor cluster.$%2-804 Exon 8 of ANRIL, which is mainly 
comprised of repeat elements, mediates ANRIL’s asso- 
ciation with target loci to modulate their expression 
through H3K27me3 deposition.’ The IncRNA Chaer 
controls hypertrophic heart growth by binding to and 
inhibiting the function of PRC2.674 

Expression of the genes in the Hox clusters are con- 
trolled by enhancer elements present in intergenic 
regions that bind regulatory proteins, which are thought 
to activate nearby protein-coding genes in cis but that are 
also co-linearly transcribed into non-coding RNAs dur- 
ing development.5%-512 Hundreds of IncRNAs associate 
with PRC2 complexes, including functionally validated 
IncRNAs such as TUGI, Meg3/Gtl2 and HOTAIRSI5-815 
(Chapter 16). 

HOTAIR is a -2.2kb spliced RNA transcribed from 
the HOXC locus, antisense to the flanking genes HOXC1]1 
and HOXCI2, which was originally shown to direct het- 
erochromatin formation in trans across a 40kb domain 
of the HOXD cluster in human fibroblasts.? HOTAIR 
was later shown also to influence gene expression at other 
sites around the genome by recruitment of PRC2 and 
LSDI/CoREST/REST repressive chromatin-modifying 
complexes.5/281657 As with Xist, HOTAIR has different 
functional domains, with a 5' domain that binds PRC2 
and a 3' domain that binds LSDI, and has been proposed 
to act as a scaffold for protein complexes,$16818 likely a 
general function of IncRNAs*995!? (Chapter 16). 

Other intergenic or antisense IncRNAs transcribed 
from homeotic loci (Evx/, HoxA13 and HoxB5/B6 and 
many others — Chapter 16) have been shown to bind 
to TrxG;?993*! the spliced IncRNA HOTTIP (-3.8 kb) 
is transcribed from a region immediately downstream 
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of the human HOXAJ/3 gene, interacts with TrxG 
MLL component WDR5 and directs the complex to 
activate HOXAI3 and additional neighboring HOXA 
genes by a mechanism that involves chromosomal 
looping.2% 

In plants, IncRNAs also control many aspects of devel- 
opment and environmental responses,5% exemplified by 
the IncRNAs COOLAIR transcribed antisense to the 
major repressor of flowering, FLOWERING LOCUS C 
(FLC), and COLDAIR transcribed from the first intron 
of FLC, which mediate cold-induced epigenetic repres- 
sion (‘vernalization’) of flowering time.?! COOLAIR and 
COLDAIR contain conserved modular secondary struc- 
tures and act by recruitment of PRC2 and other epigen- 
etic regulators, RNA-DNA R-loop formation, chromatin 
looping and the formation of phase-separated conden- 
sates.82-834 A distal COOLAIR variant sequesters TrxG 
into condensates away from the promoter*? and unspliced 
COOLAIR forms “clouds” around the locus,$% similar 
to Xist (see Chapter 16). COOLAIR is also differen- 
tially spliced in Arabidopsis variants adapted to different 
climes.837838 Another IncRNA, FLAIL, represses flower- 
ing time.5?? 

These are prominent examples among recurrent themes 
for IncRNAs, incorporating their functions as scaffolds, 
epigenetic guides, chromatin organizers and control 
devices, allosteric regulators and ribozymes, as well as 
decoys that sequester regulatory factors,?7.75,500,819,840,841 
acting as “target mimics’, ‘miRNA sponges’ or “compet- 
ing endogenous RNA s'611.620,842,843 


THE WILD WEST 


In the ‘Insights of the Decade’ section of the special issue 
of Science magazine in 2010, less than 10 years after the 
publication of the draft human genome sequence, it was 
noted that 


Many mysteries about the genome's dark matter 
are still under investigation. Even so, the overall 
picture is clear: 10 years ago, genes had the spot- 
light all to themselves. Now they have to share it 
with a large, and growing, ensemble.5^ 


Indeed, it progressively became evident that IncRNAs are 
ubiquitously involved in differentiation and development 
processes in eukaryotes (Figure 13.9). 

A recent review observed 


"The prior widely held perception that they are 
predominantly junk [should] also [be] factored 
in" to the analysis of such experiments and that 
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FIGURE 13.9  Depictions of eukaryotic RNA regulation in 1994 and less than 15 years later. (a) At a time when the under- 
standing of genetic information was still largely based on bacterial studies (upper left), proteins were thought to perform all the 
functions in cells, including regulation (arrow 1, black) in eukaryotic cells, with speculation about a possible role of RNA in 
gene regulation represented as a question mark (arrow 2, red). (Reproduced from Nowak*** with permission from the American 
Association for the Advancement of Science.) (b) After the turn of the century, many additional examples of regulatory RNAs 
emerged and shown to act in many levels of gene expression in the nucleus and cytoplasm of eukaryotic cells. (Reproduced from 
Amaral et al.'8° with permission from the American Association for the Advancement of Science.) 


[although] “there have since been more than a 
thousand publications on the functions of these 
IncRNAs, both in cis and trans, many (molecular 
biologists) are still only aware of the earlier dis- 
missive publications”.5% 


Most IncRNAs still remain experimentally untouched or 
poorly characterized — such as LINCO2476 (GenBank 
CB338058), composed of at least five exons spanning 
288 kb, found in 2003 associated with autism, in patient 
breakpoints that disrupt this non-coding RNA tran- 
script, but only now being studied due to its differential 
expression in cancer.5%8 

The bottom line is that these highly important regula- 
tory RNAs were present all along but, despite the cases 
documented in the closing decades of the 20th century 
and many more detected by transcriptomic, enhancer 
‘traps’ and other biochemical and functional screens in 
the first decade of the 21st century, they were not gener- 
ally, until recently, taken seriously. They have also been 
studied in all sorts of ways, with all sorts of specific and 
general hypotheses, dubbed “The Wild West" by Jeannie 
Lee”? and “The Noncoding RNA Revolution" by Tom 
Cech and Joan Steitz.5^? 


As Stent noted in relation to the unexpected dis- 
covery that DNA is the genetic material, the finding 
of dynamic and differential genome-wide transcrip- 
tion of intergenic, antisense and overlapping IncRNAs, 
like the related discovery of intervening sequences, 
was "premature" in the sense that it could not be read- 
ily incorporated into the existing conceptual fabric. To 
put the plethora of regulatory RNAs into full perspective 
and to integrate them into a contemporary framework 
for the genetic programming of complex organisms, we 
must first consider the epigenome and the amount of 
information required for multicellular development. 
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4 The Epigenome 


Early studies in Drosophila and other organisms showed 
that the patterns of gene expression vary in different cell 
types, which define their identity and fate, and that these 
patterns can be maintained following DNA replication 
and subsequently through mitosis. That is, there is a sec- 
ondary form of genomically encoded heritable informa- 
tion, termed 'epigenetic' information, which is embedded 
in chromatin modifications and manifested as canalized 
pathway choices during differentiation and development, 
first proposed by Conrad Waddington in the 1940s.? 

The developmentally regulated packaging of eukary- 
otic DNA into compacted heterochromatin? and more 
transcriptionally active euchromatin had been known 
since the early 20th century, with different regions of the 
genome thought to be open or closed for business, akin to 
a library compactus.5? 


CHROMATIN STRUCTURE 


In 1974, Ada and Donald Olins'? and Roger Kornberg!!!” 
reported that eukaryotic chromatin appears like "linear 
arrays of spheroid units” or “beads-on-a-string”, respec- 
tively, and that the DNA is wound like cotton around a 
spool into 11 nm diameter ‘nucleosomes’, which contain 
four pairs of histones.!! The Olins also credited another 
investigator, Christopher Woodcock, who had obtained 
similar images. Woodcock's paper? was, however, 
rejected by the journal Nature, a reviewer asserting that 
to accept the article would require “rewriting our text- 
books on cytology and genetics" and that “such a naive 
paper ... should not be published anywhere”.!* 

It was known from the 1960s from the work of Vincent 
Allfrey, Alfred Mirsky and others that histones can be 
methylated or acetylated, sometimes in response to exter- 
nal stimuli, and that these modifications affect transcrip- 
tion,^-" although the extraordinary range of histone 
modifications was not apparent until much later. 

Pluripotent cells have relatively open chromatin, as 
do cancer cells, whereas the extent of closed chroma- 
tin increases as cells differentiate.*/* Nucleosomes in 


There are two types of heterochromatin: facultative heterochromatin, 
which is developmentally regulated (such as occurs in X-chromosome 
inactivation and at many other discrete loci during differentiation 
and development) and constitutive heterochromatin (such as occurs 
in centromeric and telomeric regions of chromosomes). 
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heterochromatin are compacted into higher-order struc- 
tures, initially described as 30nm fibers, but the exact 
nature of these structures remains controversial.!?-22 
Chromatin is further compacted during meiosis and 
mitosis.2%2 While the mechanisms controlling chro- 
matin condensation and decondensation are not well 
understood, it is clear that histone modifications and 
non-histone proteins play important roles.?^?* Moreover, 
in all eukaryotes — from yeast to plants and animals — 
RNAs have been shown to be associated with chromatin, 
degradation of which by RNase changes the patterns of 
exposed DNA ???! 

The fine-scale organization of the eukaryotic nucleus, 
chromosomes and chromatin becomes more elabo- 
rate with increased developmental complexity, docu- 
mented by Torbjorn Caspersson, Julie Korenberg, Mary 
Rycowski, Georgio Bernardi, Wendy Bickmore and oth- 
ers, who also showed that cytological ‘banding’ patterns, 
gene density, intron density, protein density, GC content, 
CpG island and repeat distributions vary widely across 
chromosomes.??-40 

The classical banding patterns correlate with the 
distribution of repeats. In human chromosomes Alu 
elements are concentrated in the so-called Reverse or 
R-bands, especially in the T-bands, the most intensely 
stained and most GC-rich fraction of the R-bands. LINEI 
elements are concentrated in the alternating Giemsa or 
G-bands?253?-! and sequester genes with specialized 
functions in the nucleolus and inactive lamina-associated 
domains (see below), indicating a global role of transpos- 
able elements in orchestrating the function, regulation 
and expression of their host genes.? 


TOPOLOGICAL DOMAINS 


In situ fluorescent hybridization studies by Thomas and 
Marion Cremer, Bickmore and others from the 1990s 
showed that chromosomes occupy defined 'territories' 
in the nuclei of animal and plant cells?-^* (Figure 14.1), 
confirming the conclusions drawn by the cytogeneticists 
Carl Rabi and Theodor Boveri a century before.*%-3! 
These studies, refined and expanded by new tech- 
niques, also revealed radial segregation of chromosomal 
domains: gene-rich and actively transcribed chromo- 
somal regions are located in the center of the nucleus, 
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FIGURE 14.1 
permission of Springer Nature.) 


whereas gene-poor and genetically quiescent heterochro- 
matic regions are sequestered at the periphery, associated 
with the nuclear membrane.*751-56 

Job Dekker and colleagues showed that, in both 
animals and plants, euchromatic and heterochromatic 
regions are partitioned into megabase-sized active 'A' 
and inactive ‘B’ compartments, respectively,” which 
encompass smaller three-dimensional ‘topologically 
associated domains, or TADs, with high-frequency 
intra-chromatin interactions.225771 

A striking example is the discovery by Elphége Nora, 
Dekker, Edith Heard and colleagues that the X-inactivation 
center (XIC), which they failed for decades to define using 
cloned transgenes of up to 500kb, spans bipartite TADs> 
that occupy ~800kb of genomic territory: the promoter 
for the Xist gene, which triggers X inactivation, lies in 
one TAD of ~500 kb, whereas its antisense regulator Tsix 
lies in another TAD of -300kb.7%7 They also proposed 


Chromosome territories (CTs) in the chicken fibroblast nucleus. (Reproduced from Cremer and Cremer** with 


that TADs underlie many properties of the long-range 
transcriptional regulation that occurs in animals and 
plants,976 a prediction that coalesces with later observa- 
tions that subnuclear and subcellular compartment orga- 
nization is at least partly driven by RNA-mediated phase 
separation??? (Chapter 16). The topological organization 
of chromatin during development is also reliant on repeti- 
tive elements and their interaction with the heterochro- 
matin 1 (HP1) protein family.*!-83 

TADs have an average size of ~0.5—1 Mb, shown by 
proximity ligation (cross-linking the DNA in situ to iden- 
tify sequences physically adjacent in three-dimensional 
space), with higher resolution analyses revealing finer 
scale internal TAD organization.?^*?^ TADs appear 
to demarcated by boundary regions anchored by the 
‘insulator’ protein CTCE and the 'cohesin complex 
(Figure 14.2), which interact and control chromatin loop 
extrusion,9^5! involving phase-separation,?? evident as 


^ The structure of these domains on the X chromosome and others on 
autosomes is regulated by the IncRNAs Dxz4 and Firre." 


* There is also conflicting data, with other studies showing a poor 
correlation between CTCF binding sites and TAD boundaries. 
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"architectural stripes", where loop anchors secure topo- 
logical domains and link enhancers (see below) to cog- 
nate promoters. A cell lineage-specific subset of CTCF 
binding sites? and TAD boundaries are controlled by 
DNA methylation, indicating an interplay between epi- 
genetic modifications, chromatin organization and tran- 
script isoforms during development.?^-100 

CTCF is also associated with attachment to the nuclear 
lamina, a filamentous protein network underlying the 
nuclear membrane in animal cells, where it demarcates 
‘laminar-associated domains’ (LADs). LADs have low 
transcriptional activity,'% consistent with the earlier obser- 
vations that gene-poor and quiescent genomic regions are 
located at the nuclear periphery. The composition of the 
nuclear lamina varies in different tissues, and mutations 
in laminar proteins result in a range of conditions includ- 
ing muscular dystrophies and neurological disorders.'?? 
Lamin-ike proteins also occur in plants and dynami- 
cally tether heterochromatin to the nuclear periphery in 
response to environmental and developmental signals.* 

Vertebrate genomes are also partitioned into ‘iso- 
chores’, megabase-sized domains of different G+C con- 
tent, which are most pronounced in mammals.!% Isochores 
may correlate with TADs and LADs, with the G+C distri- 
bution apparently playing a role in “moulding” chromatin 
accessibility, although the relationship is unclear.!%* 

The number of LADs, TADs and replication domains 
(~2,000) in the human genome is similar to the number of 
chromosome bands observed in prometaphase chromo- 
somes.1%510% TADs and TAD boundaries also correspond 
with the bands and inter-bands seen on Drosophila poly- 
tene chromosomes," as well as with ‘chromomeres’ — 
locally coiled chromatin domains observed in mitotic 
and meiotic prophase chromosomes, ?*.09 supported by 
the observation that TADs are condensed chromatin 
domains separated by regions of active chromatin.®> 

Some reports suggest that TADS are stable across evo- 
lution, cell types and independent of gene expression, and 
may represent DNA replication modules,85111-114 whereas 
others indicate that TADs, and to a lesser extent A and B 
compartments, vary among cell types and are reorganized 
during differentiation and development.49606471115 TADs 
may be equivalent to the chromatin domains formed by 
enhancer action?^!ó (see below). TADs in human plu- 
ripotent stem cells are demarcated, at least in part, by 
transcriptionally active HERV-H retrotransposons!" and 
regulated by the RNAi pathway via AGOI association 
with expressed enhancers.!!8 Some evidence suggests 


4 Many CTCF binding sites are derived from transposable elements.” 
* DNA methylation also regulates alternative polyadenylation via 
CTCF and the cohesin complex.” 
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that megabase-scale TADs are largely cell-type invari- 
ant, whereas ‘subTADs’ reconfigure in a cell type-spe- 
cific manner.!! TAD reconfiguration at the HoxD locus 
appears to regulate limb development,!!” and cell-type 
specialization is encoded by chromatin topologies.!!* 

TADs are also reorganized in response to physiologi- 
cal parameters, such as hormone signaling and neuronal 
activation.!20-122 They are also the functional units of 
the DNA damage response, required for the one-sided 
cohesin-mediated loop extrusion of chromatin domains 
containing the double-strand break-specific histone vari- 
ant, phosphorylated H2A.X (see below), a process that 
involves transcription of non-coding RNAs.!2 Mutations 
affecting TAD boundaries are associated with human 
developmental disorders and cancers, apparently due to 
aberrant promoter-enhancer interactions.!24125 


ENHANCERS 


‘Enhancers’ are upstream, downstream or intronic non- 
protein-coding genomic regions in animals and (to a 
lesser extent?) plants that control developmental cell- 
type-specific spatiotemporal expression patterns of 
protein-coding and non-protein-coding genes in their 
neighborhood, by altering the organization of chroma- 
tin.?7-15? Enhancers can be located hundreds of kilobases 
away from their target genes and are (local) position and 
orientation-independent.126.134-140 

Enhancers were classically recognized and genetically 
defined by their developmental effects, rather than their 
biochemical properties or mode of action. Although not 
described as such, enhancer activity was first observed in 
the bithorax complex of Drosophila,'*-143 but it was only 
in the early 1980s that the term was coined to describe 
the unexpected ability of SV40 viral DNA sequences 
to increase the expression of a cloned f-globin gene.!* 
Many tissue-specific enhancers, often containing repeti- 
tive elements similar to those in viral enhancers,! were 
subsequently identified in mammalian immunoglobulin 
and globin gene? loci, as well as in the Drosophila bitho- 
rax complex and other genes that show restricted expres- 
sion patterns during development (Figure 14.3), initially 
using deletion strategies.!34135,146-151 

Many enhancers have since been identified by other 
genetic and bioinformatic approaches,"?-^ the former 
using insertions of transposons with reporter genes, 
called ‘enhancer trapping, in Drosophila,'*3:132,156157 


f Endogenous retroviruses have been shown to be a source of 
enhancers.!% 

= Globin enhancers were originally and are still often referred to as 
‘locus control regions”.!+ 
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FIGURE 14.2 The structural features of topologically associating domains. (a-d) Heat-map representations (top) and schema- 
tized globular interactions (bottom) of TADs (a,b) and nested subTADs (c,d). (e) Cartoon representation of different classes of 
contact domains parsed by their structural features and degree of nesting. (f) Identification of contact-domain classes from e in 
cortical neuron Hi-C data,s* binned at 10-kb resolution. (g) Cohesin translocation extrudes DNA in an ATP-dependent manner 
into long-range looping interactions that form the topological basis for TAD and subTAD loop domains. (h-k) Contact frequency 
heat maps of high-resolution Hi-C data from embryonic stem cells (ESC, h,j) and neural progenitor cells (NPC; i,k).** (h,i) Green 
arrows denote the corners of a subset of the nested chromatin domains evident in this genomic region. (j,k) Green arrows anno- 
tate a high-insulation-strength, cell-type-invariant TAD boundary. Blue arrows point to a lower-insulation-strength, cell-type- 
dynamic subTAD boundary. (Reproduced from Beagan and Phillips-Cremins!? with permission of Springer Nature.) 
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FIGURE 14.3 Restricted expression patterns of embryonic enhancers at two different developmental stages visualized by lacZ 
expression in transgenic Drosophila embryos. (Reproduced from Stathopoulos et al.!? with permission of Elsevier.) 


plants!55 and vertebrates.!* More recently attempts have 
been made to characterize known enhancers and iden- 
tify others by genome-wide analysis of the binding posi- 
tions of presumed signature proteins (the ‘transcriptional 
co-activators P300 and Mediator?) combined with 
the presence of correlated histone modifications,!67-'7! 
the presence of nucleosome-depleted regions and/or the 
expression of ‘enhancer RNAs’ (eRNAs),?-? which 
yield different prediction sets and blur the distinction 
between enhancer and (protein-coding) gene promot- 
ers!?7,139,154,180 (see below and Chapter 16). 

The appearance of enhancers has been linked to the 
emergence of animal multicellularity and phenotypic 
diversity, 5/3? neuronal expansion in vertebrates! and 
the recent evolution of primates.^ Positive selection 
for nucleotide changes in enhancers has contributed, for 
example, to the uniquely human aspects of thermoreg- 
ulation (sweat glands in the skin)? and digit and limb 
patterning, including the increase in size and rotation 
of the thumb toward the palm for enhanced dexterity.15 
Body plan specification is controlled by multiple enhanc- 
ers to ensure precise patterns of gene expression.!*!,187 
Clusters of enhancers, such as those at the beta globin 


^ P300 is a histone modifying enzyme.!% Mediator is a highly modular 
multi-subunit complex that appears to connect distal transcription 
factors with the transcription initiation machinery.!%-10 Both P300 
and Mediator bind RNA, which is required for their chromosomal 
localization, TAD juxtaposition and local chromatin modifica- 
tion!^^-1*6 (Chapter 16). 


locus, but also many others, have been dubbed “super- 
enhancers", “stretch enhancers" or "enhancer jun- 
gles”.!27,188-193 Enhancers also play a role in the etiology of 
cancer, 77192. and disruptions of chromatin topological 
domains cause rewiring of gene-enhancer interactions 
with pathogenic consequences.9^122.131.136,195,196 

Enhancers are still incompletely defined, physically 
and conceptually,!281515+ but have been described as 
“DNA logic gates”. Mechanistically, enhancers were 
originally conceived as clusters of transcription factor 
binding sites that are brought into contact with target pro- 
tein-coding gene promoters by long-distance DNA loop- 
ing, a model first proposed by Mark Ptashne to reconcile 
enhancer function with transcription factor control of gene 
expression.!% The persistence and vagaries of the initial 
interpretation of how enhancers work!24128,131,136,169,199 has 
been referred to by Marc Halfon! as a case of “founder 
fallacy” and “validation creep”,'** by no means the first 
in molecular biology or science generally. 

There is good evidence that enhancer action leads to the 
juxtaposition of distal chromosomal sequences in three- 
dimensional space, and to consequent transcriptional acti- 
vation of genes in their orbit.2°° Enhancer-mediated DNA 
looping may be equivalent to TADs!31,20! but enhancers 
can exert their action across TAD boundaries, which may 


! Halfon notes, for example, that “a recent paper erroneously states 
that enhancers *were first described as nucleosome-depleted regions 
with a high density of sequence motifs recognized by DNA-binding 
transcription factors'"^* 
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in turn play a role in mediating formation, reorganization 
and/or juxtaposition of such domains, ??.131.132.139,143,157,202,203 
although genome topology and gene expression can be 
uncoupled.?^ Enhancers also recruit histone-modify- 
ing chromatin remodeling proteins, such as the CREB- 
binding protein (CBP, see below).!%.20> 

However, evidence for the direct interaction of transcrip- 
tion factors bound at enhancers with target protein-coding 
gene promoters is limited, in some cases contradictory,” 
and intimate contact may be more an enduring presump- 
tion than an accurate mechanistic description, ^^ espe- 
cially in view of the fact that enhancers are transcribed 
in the cells in which they are active.154%155,146-150,172-179,207-209 
Indeed, enhancers have many if not all of the character- 
istics of bona fide genes, including promoters.?!??!! Most 
IncRNAs originate from enhancers??? and enhancer 
RNA production is considered the most reliable indica- 
tor of enhancer action.?-? How enhancers select their 
targets is unknown, but likely involves RNA-DNA, RNA- 
RNA and RNA-protein interactions? ??^ (Chapter 16). 

Strikingly, the number of mammalian enhancers, esti- 
mated to be in the hundreds of thousands,130.170.172.180,192,216-219 
far outweighs the number of protein-coding genes, which 
indicates that distal sequences that regulate developmen- 
tal expression patterns occupy a much larger fraction of 
the genome than those constituting the proximal promot- 
ers of protein-coding genes. 


NUCLEOSOMES AND HISTONES 


Partial digestion of exposed DNA in chromatin with 
micrococcal nuclease yields a ladder of modal DNA 


FIGURE 14.4 
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lengths in multiples of ~180bp, reflecting 147bp of DNA 
supercoiled around the outside of the nucleosome core 
particle and ~35bp of linker DNA between (in mammals), 
although the average length of the linker sequence varies 
between species and cell types.??? 

There are approximately 30 million nucleosomes in 
a human cell.?! Canonical nucleosomes are composed 
of an octamer of four small, highly basic proteins: his- 
tones H2A, H2B, H3 and H4; the central H3-H4 tetra- 
mer is sandwiched between two H2A-H2B dimers and 
the N-terminal tails of the histones, which protrude 
beyond the DNA shell and are the major sites for post- 
translational modifications??-?^ (Figure 14.4) (see 
below). Canonical histones are produced during the rep- 
licative S-phase of the cell cycle and are among the most 
highly conserved proteins in evolution. Interestingly, 
the genes encoding the canonical (but not the variant) 
histones are some of the few genes that lack introns, pos- 
sibly as their constitutive production with chromosomal 
replication does not require efference signals to be trans- 
mitted in parallel. 

Archaeal histones form a structure similar to the 
eukaryotic H3-H4 tetramer, but, unlike eukaryotic his- 
tones, lack extended N-terminal tails and post-transla- 
tional modifications?" Both possess a copper (Cu2+) 
binding site at the H3 dimerization interface and have 
been shown to have copper reductase activity,?* suggest- 
ing that they originated as a mechanism for copper utili- 
zation under oxidizing conditions.2% 

Another histone, H1, binds to the outside of the nucleo- 
some at the entry and exit sites of the DNA to stabilize the 
particle and/or play a role in coiling of nucleosomes into 
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(a) Nucleosome structure showing histone octamer core, encircling DNA and protruding histone tails. 


(Reproduced from Luger et al.?? with permission of Springer Nature.) (b) Some of the many modifications of histone N-terminal 
tails. (Reproduced from Zhao et al.?9 with permission of Springer Nature.) 
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higher-order structures.?92* A homolog of histone HI 
exists in bacteria, and also appears to have been acquired 
by eukaryotes at the time of their origin.?? 

Nucleosomes were initially thought to be simply a 
means of compacting genomes - there is -2.5 m of DNA 
in a mammalian cell — and this is likely an important 
function. Nonetheless, they are not static but dynamic 
structures, histones being exchanged and differentially 
modified during differentiation and development.?3!233-*35 

The promoters of protein-coding genes and devel- 
opmental enhancers initially appeared to be 'nucleo- 
some-free regions’ based on their sensitivity to DNase 
digestion and accessibility to transcription factors.236 
However, more sensitive approaches have revealed that 
nucleosomes do occur in the vicinity of promoters but are 
“unstable” and subject to higher turnover.?^ 2724 

There are also variant forms of nucleosomes, mostly 
involving H2A. H2A can be replaced by H2A.Z, which, 
unlike other histones, is multi-exonic and produced 
throughout the cell cycle. H2A.Z is present in most tis- 
sues, but most highly expressed in embryos, essential for 
development in insects, vertebrates and plants?**24 and 
associated with memory formation.?^? 

There exist two H2A.Z genes encoding almost identical 
proteins (three amino acid differences) in chordates, one 
of which expresses a primate-specific alternatively spliced 
isoform in the brain.25%251 The two H2A.Z subtypes display 
differential occupancy at the promoters of protein-coding 
genes and enhancers, and regulate genes involved in early 
embryological, neural crest and craniofacial development 
development,?^92? as well as the progression of some types 
of cancers.2% Just one of the three amino acid differences 
between the H2A.Z subtypes is sufficient to rescue the 
developmental abnormalities caused by mutations in an 
enzyme that catalyzes replacement of the canonical H2A- 
H2B dimer with the H2A.Z-H2B dimer? 

The H2A variant H2A.X is recruited to double- 
stranded DNA breaks and its phosphorylated form is 
required for their repair, a process that is also involved 
in programmed genomic rearrangements during immune 
cell development.75%255 

Another H2A variant (‘macroH2A’) contains an addi- 
tional and highly conserved large C-terminal domain 
that has homologs in all kingdoms of life and is cova- 
lently linked to its N-terminal histone homology domain, 
which is also highly conserved but quite different from 
that in the canonical H2A.2% The macro domain has 
ADP-ribosylation activity and possibly RNA-binding 
activity. MacroH2As are encoded by two multi-exonic 
genes, one of which is alternatively spliced in the macro 
domain. They associate with the inactive X chromosome 
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of female mammalian cells and inactive genes and appear 
to have a role in maintaining heterochromatin.25825 

There are also short variants, H2A.B, H2A.L, H2A.P 
and H2A.Q and splice isoforms thereof, which lack the 
C-terminal tail of the core H2A. These variants appeared 
in mammals and are tissue-specific, being expressed in 
the testis and, in the case of H2A.B, also in the brain.?9? 
The H2A.B variant binds RNA, replaces H2A.Z in 
nucleosomes at transcription start sites and intron-exon 
boundaries in the testis and the brain, and interacts with 
RNA polymerase II to promote the activation of transcrip- 
tion.261-26 Tt is also involved in biparental inheritance con- 
trolling embryonic development in mice.?** H2A.B has 
a propensity for chromatin decompaction?®!?® and co- 
localizes with the RNAi proteins Miwi and Dicer in sper- 
matids,?% indicating a relationship between regulatory 
RNAs, chromatin organization and splicing pathways. 

The H2A.L.2 variant also has an RNA-binding domain 
and appears to be guided to its sites of incorporation by 
RNA.?2 [n sperm development it dimerizes with the H2B 
testis-specific variant TH2B as a prelude to nucleosome 
displacement by other highly basic proteins called prot- 
amines,% originally discovered by Miescher/9" which 
mediate the extreme compaction of the chromosomes. 

Histone H3 is replaced in nucleosomes by H3.3 (which 
differs from H3 by only four amino acids) in telomeres 
and pericentromeric regions and when chromatin assem- 
bly occurs at times other than replication,?997 including 
in meiotic sex chromosome inactivation.?? Histone H2A- 
H2B is bound to an essential telomerase RNA domain, 
which suggests a role for histones in the folding and func- 
tion of the telomerase RNA component.?”* 

H3.3/H2A.Z double variant-containing nucleosomes 
are enriched in active promoters and enhancers.2?7238 Loss 
of H3.3 results in fertility and/or defects in gastrulation or 
neural crest development in flies, fish and frogs.23727527 In 
mammals, H3.3 also accumulates in neurons, reaching near 
saturation by adolescence, where it controls neuronal- and 
glial-specific gene expression patterns, with an essential 
role in plasticity and cognition.?* Rare missense mutations 
in H3.3 have been shown to cause neurologic dysfunction 
and congenital anomalies.?? Mutations of lysines in the tail 
of H3.3 are commonly observed in glioblastomas.25025! 

In flies and vertebrates, there are two seemingly 
redundant genes encoding H3.3, H3A) and H3B, which 
vary in their 3'UTRs,?? whose individual loss in mam- 
mals causes infertility and reduced viability.?5525^ Loss of 


i The coding sequence of H3A appears to have evolved under strong 
purifying selection in the lobe finned fish and tetrapods, without any 
change to the amino acid sequence??? 
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both causes embryonic lethality, due to heterochromatic 
dysfunction at telomeres and centromeres,?* the latter of 
which can be rescued by injecting dsRNA derived from 
pericentromeric transcripts, indicating a functional link 
with the silencing of such regions by an RNAi pathway.?8* 

In centromeres, H3 is replaced by another variant, 
CENP-A, which is essential for kinetochore formation 
required for chromosome segregation during mitosis 
and meiosis.256287 [n plants, epigenetic memory is reset 
by replacing H3 with the variant H3.10 (which is refrac- 
tory to lysine 27 methylation) during sperm maturation to 
globally reprogram paternal gametes.?88 

We could go on. The bottom line is that there is a high 
degree of complexity in the composition of nucleosomes, 
dynamic histone exchange and remodeling of chromatin 
during development.^? However, little is known about 
the decisional processes and mechanisms that determine 
when and where different histones are incorporated into 
particular nucleosomes, other than that RNA and 'pioneer 
transcription factors' are involved (Chapter 16). 


NUCLEOSOME REMODELING 


Pioneer or 'architectural' transcription factors, such as 
Sox2 and Sox11,* which are ‘high mobility group’ (HMG) 
proteins, bend DNA structure and initiate the opening 
of chromatin by eviction of the linker histone H1.290-2% 
Other pioneer factors, such as the winged helix/forkhead 
box (Fox) proteins, bind to DNA within nucleosomes in 
promoters and enhancers leading to their destabilization, 
also by histone H1 displacement, and recruit Mediator 
and cohesin to permit chromatin access for tissue-specific 
remodeling factors such as FoxA, with different targets 
in different cells at different stages of development.??^-295 

Histones are escorted to nucleosomes by companion 
or ‘chaperone’ proteins,?%,286,299,300 and histone exchange 
requires a conserved family of ATP-dependent ‘chroma- 
tin remodeling enzymes’ variously known as SWI/SNF, 
ISWI, NuRD, CHD and INO80,?! many now known to 
be regulated by cis- and trans-acting non-coding RNAs 
(Chapter 16). 


HISTONE MODIFICATIONS 


Histone modification by methylation and acetylation was 
first observed and proposed to have a regulatory func- 
tion in the 1960s, and nucleosomes were known to affect 
transcription,*%2-30* but it was not until 1991 that Michael 


* Sox2 and Sox11 are involved in the maintenance of pluripotency and 
neuronal differentiation, respectively. 
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Grunstein and colleagues provided definitive evidence of 
gene regulation by histone acetylation.*% In 1996, David 
Allis and colleagues isolated a histone acetyl transferase, 
making use of the fact that histones in the macronucleus 
of Tetrahymena cells are highly acetylated whereas those 
in the micronuclei are not, which for the first time directly 
linked a transcriptional regulator to a histone-modifying 
enzyme.306307 A reciprocal histone deacetylase (HDAC) 
activity was reported a month later by Stuart Schreiber 
and colleagues.?05 

Shortly thereafter it was shown that the mamma- 
lian ‘transcriptional co-activators’ CBP, P300 and the 
yeast ‘transcriptional adaptor protein’ Gcn5 function in 
multi-subunit complexes to acetylate histones in nucleo- 
somes.!60,302310 linking what had been vaguely referred to 
as ‘transcription factors’ to chromatin modification. These 
findings changed the perception of the nucleosome from 
being simply a mechanism for genome compaction to a 
major player in regulating its expression. 

The 1990s and 2000s saw the identification of a bewil- 
dering array of histone modifications, mainly by mass 
spectrometry — many of which still remain to be charac- 
terized?!!^? — at last count in over 60 different positions, 
mainly in the N-terminal tails of the histones, which are 
intrinsically disordered?'^?^ (Chapter 16) and exposed 
beyond the periphery of the nucleosome. These modifi- 
cations span mono-, di- and tri-methylation, acetylation, 
ADP ribosylation, ubiquitylation and/or sumoylation of 
various lysines in histones H2A, H2A.X, H2B, H3 and H4, 
mono- and di-methylation, acetylation and deimination of 
arginines (to citrulline™) in H2A, H3 and H4, phosphory- 
lation of serines, threonines, tyrosines and one lysine in 
H2A, H2A.X, H2B, H3 and H4, isomerization of prolines 
in H3, and O-palmitoylation of a serine in H4.27520.21 

Histone modifications also include propionylation, 
butyrylation, malonylation, formylation, glutathionylation, 
tyrosine hydroxylation and lysine crotonylation,3!1.312,322,323 
the latter at 28 different lysines in H1, H2A, H2B, H3 and 
H4.% Many of these modifications are in low abundance, 
suggesting particular contextual functions. An example 
of their impact, however, is that citrullination of histone 


! Some modifications also occur in the internal globular domains of 
the histones.?!! 

™ Citrulline is also an intermediate in the urea cycle. Citrullination 
catalyzed by peptidylarginine deiminases (PADs) neutralizes argi- 
nine's positive charge, can antagonize arginine methylation for local 
gene regulation and global chromatin decompaction, with implica- 
tions for cell pluripotency and differentiation.*!?31% The peptidylar- 
ginine deiminase PAD4 is essential for the remarkable formation of 
neutrophil extracellular (chromatin) traps (NETS) that are phagocy- 
tosed by macrophages to stimulate innate immune responses during 
infection.317-319 
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HI leads to its displacement from the nucleosome and the 
decondensation of chromatin in pluripotent cells and dur- 
ing developmental reprogramming.*!®322 

The shorthand nomenclature for modifications is 
histone > amino acid (single letter code) > position > 
modification — for example, the methylation of arginine 11 
on histone H4 is written as H4R11me, and the acetylation 
of lysine 5 on histone H2B is written as H2BKSac, etc. 

Other more exotic modifications have been discovered, 
such as histone ufmylation (the conjugation of UFMI 
ubiquitin-like protein to H4 to promote DNA repair), ?»3?6 
the covalent conjugation of the metabolite lactate at 28 
sites on core histones (histone "lactylation")?" and the 
conjugation of the neurotransmitters serotonin and 
dopamine to H3 glutamine 5 (H3Q5ser) and trimethyl- 
ated lysine 4 (H3K4me3Q5ser) in specific regions of the 
brain%28.32 Cocaine administration, which causes dopa- 
mine release from the ventral tegmental area (VTA), 
induces hyperacetylation" of H3 and H4 at genes associ- 
ated with cocaine addiction in the nucleus accumbens, a 
brain ‘reward’ region.*! Moreover, rats undergoing with- 
drawal from cocaine dopaminylate histone H3 glutamine 
5 (H3Q5dop) in the WTA, inhibition of which reverses 
cocaine-mediated gene expression changes, attenuates 
dopamine release in the nucleus accumbens, and reduces 
cocaine-seeking behavior.?* These are potentially pro- 
found observations for understanding brain function — 
neurotransmitters have lasting epigenetic effects. 

There are over 100 enzymes known to catalyze his- 
tone modifications at particular amino acid positions 
in mammals (called code ‘writers’), and dozens more 
that remove them (‘erasers’),° mostly acting on histone 
H3,% with a similar albeit less extensive repertoire in 
other animals, plants and fungi. Many of these proteins 
are encoded by homologs of genes first identified as 
critical for Drosophila development, notably Polycomb, 
Trithorax and Zeste (Chapter 5). The two multi-subunit 
Polycomb complexes in mammals, PRCI and PRC2, act 
non-redundantly at target genes to maintain transcrip- 
tional programs and cellular identity. PRC2 methylates 
lysine 27 on histone H3 (H3K27me), while PRCI ubiqui- 
tinates histone H2A at lysine 119 (H2AK119ub),5% both 
preferentially at unmethylated CpG islands,**4+ with a 


n Alcohol consumption also increases histone acetylation in fetal and 
adult mouse brain.% 

? The discovery of histone modification erasers was unexpected by 
many, as were the discoveries of DNA demethylases (see below) 
and RNA modification erasers (Chapter 17). Indeed, most levels 
of regulation beyond transcription factors were initially met with 
skepticism, and then largely shoehorned into the transcription factor 
paradigm. 
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complex interplay between them, including, for example, 
with the core PRC component EED, which recruits his- 
tone deacetylases.5336 Trithorax proteins, which activate 
gene expression, contain the SET domain, which methyl- 
ates H3K4 and is found in all eukaryotes.??? 

Substantial innovations in the subunit composition 
of chromatin-modifying complexes have accompanied 
increased developmental complexity. Histone modifica- 
tion writer, reader and eraser complexes are more elab- 
orate and diverse in mammals than invertebrates. The 
Drosophila PRC1 complex, for example, has just one ver- 
sion of its constituent subunits, whereas mammalian PRCI 
can incorporate any one of two RING subunits, three PHC 
subunits, six PCGF subunits and five CBX subunits,338-340 
the latter of which interact with the neural gene repression 
factor REST?" and appear to be involved in the formation 
of local phase-separated domains??? (Chapter 16). 

Similarincreases in subunit complexity and/or the num- 
bers of orthologs or genomic binding sites also occurred 
in the Mediator complex in metazoans,!©3 CTCF in 
bilaterians,34* the HUSH complex for heterochromatin 
regulation in vertebrates,** and the major expansions of 
the fast evolving zinc-finger transcription factors (one 
of the largest gene families in humans), many of which 
have associated metazoan-specific BTB, tetrapod-specific 
KRAB or mammal-specific SCAN domains.*46347 

Different types of histone modifications are recog- 
nized by over 70 known ‘reader’ proteins, many of which 
contain Tudor, PHD finger, MBT, bromo or chromo 
domains that occur in a range of chromatin remodel- 
ing and histone-modifying factors.???2455? PHD fingers 
read the tail of histone H3, primarily the methylation 
state of H3K4 (K4me3/2), and to a lesser extent the 
methylation state of H3R2 (R2me2)” and the acetylation 
state of H3K14.*°? Bromo domains? primarily recognize 
acetylated lysine residues,% and occur along with acet- 
yltransferase domains in the pioneer factors CBP and 
P300.35 Chromo, Tudor and MBT domains are part of 
an extended family that evolved from a common ancestor 
and recognize methylated lysines.358.35 

Underscoring their importance, mutations in his- 
tone modification writers, readers and erasers cause 
developmental abnormalities, intellectual disabilities and 
cancers.360364 For example, 10% of leukemias are caused 
by translocations and ectopic fusions of the Trithorax 
homolog KMT2A (lysine-specific methyltransferase 2A), 


P The 7SK RNA/P-TEFb complex has also been reported to be a 
‘reader’ of the H4R3me2 modification???» 

4 Bromodomain proteins have been explored as a target for anticancer 
drugs, with mixed results.*°> 
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previously called MLLI — for ‘mixed lineage leuke- 
mia’ 1,36 Dysregulation of the chromatin-binding PHD 
finger protein JARIDI, which binds H3K4me2/3, also 
causes leukemias.?96 Haploinsufficiency of histone deacet- 
ylase 4 (HDAC4) results in brachydactyly mental retarda- 
tion syndrome.*% A number of drugs that inhibit histone 
deacetylases have been licensed for use against hemato- 
poietic cancers, particularly lymphomas and myelomas.*8 


THE HISTONE CODE 


In 2000, David Allis and Brian Strahl proposed the ‘his- 
tone code hypothesis": 


First, the establishment of ... a combinatorial pat- 
tern of histone modification, i.e., the histone code, 
in a given cellular or developmental context ... 
Second, the specific interpretation or the ‘reading’ 
of the histone code ... (which) function broadly 
to set up an epigenetic landscape that determines 
cell fate decision-making during embryogenesis 
and development.370 


The last sentence is the key and far-reaching conclusion, 
which takes gene regulation in eukaryotes well beyond 
conventional transcription factors and suggests that epi- 
genetic processes comprise the senior level of control 
of developmental trajectories, notwithstanding the fact 
that the differentiation state of cells can be changed by 
ectopic expression of transcription factors (Chapter 15). 

It has taken a long time for this view of the regulation 
of cell fate during development to overcome the hege- 
mony of transcription factors, and there has been staunch 
opposition to it. As Allis later recalled: 


Chromatin studies in this era paled in compari- 
son with the more exciting studies on transact- 
ing transcription factors that were all the rage ... 
Moreover, well defined paradigms of gene regu- 
lation had been elegantly worked out in prokary- 
otic models ... Histone proteins were viewed as 
only being in the way of where all of this exciting 
action took place. My career choice to study his- 
tone biology was a steep uphill climb, especially 
given the popular notion that histones did not 
really matter in gene regulation.*”! 


Even after histone modifications were shown to have a 
role in the regulation of the expression of iconic genes 
involved in development, their action was widely inter- 
preted in terms of nucleosome control of transcription 
factor accessibility, rather than considering what might 
regulate nucleosome position and histone modification 
state in the first place. 
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The emphasis on transcription initiation as the main 
focus of 'gene regulation' and the resistance to the sug- 
gestion that epigenetic regulation may determine which 
genes are available to be transcribed are perhaps best 
illustrated by a 2013 article by Mark Ptashne, who pio- 
neered the characterization of transcription factor bind- 
ing to DNA in bacteria and yeast.!98372-3% Ptashne's 
article, entitled ‘Epigenetics: Core Misconcept’,*” stated: 


Development of an organism from a fertilized egg 
is driven primarily by the actions of regulatory 
proteins called transcription factors ... Rather, it 
is said, chemical modifications to DNA ... and to 
histones ... drive gene regulation. This obviously 
cannot be true because the enzymes that impose 
such modifications lack the essential specific- 
ity ... and so these enzymes would have no way, 
on their own, of specifying which genes to regu- 
late under any given set of conditions??? 


The latter point is correct, but Ptashne and many others 
overlooked the possibility that the specificity might be 
supplied by trans-acting RNAs, despite the fact that he 
had elsewhere recognized that RNA molecules can act as 
a transcriptional co-activators.*7% Of course, regulation of 
chromatin organization and transcription initiation is not 
mutually exclusive nor separable; the factors involved act 
in concert to govern the complex patterns of gene expres- 
sion during development (Chapter 15). 

Deciphering the histone code is a huge challenge, not 
the least because of the difficulty of analyzing the modifi- 
cations and their effects on gene expression at the nucleo- 
some level, the dependency of the context of the large 
combination possibilities of chromatin marks, and the 
heterogeneity of the samples. Nonetheless, the growing 
popularity of the field not only led to the rapid discovery 
of the many enzymes and complexes involved?”?378 but 
also the roles of modifications by a number of pioneering 
labs," using in vitro approaches (e.g., with reconstituted 
nucleosomes) and modification-specific antibodies for 
global analysis of the in vivo distribution of nucleosomes 
containing the modification.?7320,379-381 

The latter revealed non-random patterns of modifi- 
cations in different tissues and developmental stages, 
such as in the Neurod2 gene in the brain (Figure 14.5), 
hypoacetylation of the inactive X chromosome in female 
mammals and silent mating type genes in yeast, and 
hyperacetylation of the upregulated X chromosome 


* Including those of Allis, Shelley Berger, Rudi Jaenisch, Thomas 
Jenuwein, Manolis Kellis, Tony Kouzarides, Bob Kingston, Danny 
Reinberg, Bing Ren, Bryan Turner, Rick Young, Jerry Workman, Shi 
Yang and many others. 
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FIGURE 14.5 Dynamic landscape of histone modifications at the mouse Neuro2d (Neuronal Differentiation 2) locus in different 
tissues and during development. (Reproduced from ENCODE Project Consortium*% under Creative Commons CC BY license.) 


in Drosophila males or transcribed globin genes in histone modifications are differently imposed in com- 
erythrocytes ?*! plex patterns at millions of different genomic positions 

High-resolution mapping by sequencing of immu- in different tissues or cell types at different stages of dif- 
noprecipitated chromatin (‘ChIP-seq’) has shown that ferentiation and development.*%2-386 There is clearly also 
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‘crosstalk’ between histone modifications,3%73% which 
may occur in modules, but little is yet understood of the 
lexicon or syntax.**4 

Active genes are characterized by acetylation of various 
lysines or arginines, which neutralizes their charge interac- 
tions and makes chromatin more accessible.*83% Different 
acetylations are found in different regions of genes and reg- 
ulatory regions: H2AK9ac, H2BK5ac, H3K9ac, H3K 18ac, 
H3K27ac, H3K36ac and H4K9lac are mainly located in 
the region surrounding the transcription start site, whereas 
H2BK12ac, H2BK20ac, H2BK120ac, H3K4ac, H4KSac, 
H4K8ac, HAK12ac and H4Kl6ac are elevated in the pro- 
moter and transcribed regions of active genes.384 

Nonetheless, even the roles of well-studied modifi- 
cations, including acetylation, of different histones and 
residues by distinct complexes in different cell types and 
species are far from fully characterized. For example, in 
human cells the histone acetyltransferase KAT8* modi- 
fies different H4 residues (H4K5 and H4K8 vs H4K16) 
depending on its associated proteins, with different 
regulatory and pleiotropic effects.?^ H3K27ac marks 
are generally thought to distinguish active enhancers 
from inactive/poised enhancers that contain H3K4mel 
alone,!67395 although H3K272c alone is insufficient to per- 
mit enhancer activity??? (see below). 

Conversely trimethylation of the same lysine 
(H3K27me3) marks facultative heterochromatin (regions 
that are differentially expressed in development and/or 
differentiation), such as the inactive X chromosome.?” 
This modification is imposed by PRC2 through one of two 
alternative catalytic subunits, EZH1 or EZH2,' which are 
expressed at different stages of development.*% Mutations 
in EZH2 cause Weaver Syndrome, which is character- 
ized by skeletal and cognitive abnormities.?? Mutually 
exclusive acetylation and methylation also occur at other 
lysines including H2BK5, H3K4, H3K9 and H3K36, all 
of which are acetylated at active promoters.?*^ 


* KATS, also known as MOF or MYSTI, is classically associated with 
H4K16 acetylation in transcriptional activation, notably in the MSL 
complex that executes the roX RNA-directed global upregulation of 
the expression of the X-chromosome in Drosophila males for dosage 
compensation. Disruption of the orthologous human MSL complex 
also impairs H4K16 acetylation and results in an X-linked syn- 
drome marked by developmental delay, gait disturbance and facial 
dysmorphism,?! as well as tumor maintenance by exacerbating 
chromosomal instability,*23% the latter exemplifying that histone 
modifications have other roles in chromosome biology beyond the 
regulation of gene expression. 

EZHI and EZH2 contain the lysine-specific SET (Su(var)3-9, 
Enhancer of Zeste, Trithorax) domain that uses the cofactor 
S-adenosyl-L-methionine (SAM) as the methyl donor. 
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H3K27me3 has been widely implicated in restraining 
the expression of lineage-specifying and cell-state defin- 
ing loci from plants to animals,?355400-405 and mutation of 
this residue recapitulates PRC2 transformations.‘ Its role 
in regulating the timing of the differentiation of progeni- 
tor cells has also been linked with epigenetic switches 
controlled by opposing PRC2 and Kdm6a/b demethylase 
activities, for example in regulating T cell commitment 
timing in mammals.» H3K27me3 repression of gene 
expression also appears to be confined within TADs.4^02406 

There have been various attempts to use signature his- 
tone marks to identify enhancers, with initial correlations 
with the binding of the transcriptional co-activator P300 
suggesting that enhancers are characterized by the pres- 
ence of monomethylated histone H3 lysine 4 (H3K4mel) 
and the absence of trimethylated H3K4 (H3K4me3)."? 
However, subsequent studies showed that H3K4me3 is 
enriched, whereas H3K4mel is reduced, in highly active 
enhancers,/9?^?7 and that characterized enhancer regions 
contain a variety of histone modifications in different 
combinations, not necessarily the presumed canonical 
H3K4mel or H3K27ac marks.!/!584405409 Bioinformatic 
predictions of enhancers based on histone modification 
patterns alone have low validation rates.!55407410 

The H3K4me3 modification is not only associated with 
active enhancers but also with actively transcribed pro- 
tein-coding genes,*!! or genes “poised" for transcriptional 
activation.^^-^" H3K4me3-modified histones exhibit a 
peak around transcriptional start sites??? and interact with 
RNA polymerase subunit TFIID.5%418-+0 Transcription 
start sites also exhibit a typical flanking bimodal pattern 
of H3K4me2- and H3K4me3-marked nucleosomes.*2041 

So-called “bivalent domains” containing both H3K4 
and H3K27 methylation occur around conserved non- 
coding sequences associated with developmentally 
important transcription factors, suggesting that chroma- 
tin state is important for maintaining embryonic pluripo- 
tency.223 Recent data shows that pluripotent states are 
determined by interactions between chromatin modifica- 
tions and enhancer expression to reconfigure the target 
specificity of the pioneer transcription factors Oct4, Sox2 
and Nanog.?^ Other modifications, such as H4Kl6ac 
occur in active enhancers and protein-coding genes,"! 
further obscuring the distinction between them. 

H3Kl4pr and H3Kl4bu are (also) preferentially 
enriched at promoters of active genes?? and H2AK119ubl 
guides maternal inheritance and zygotic deposition of 


u It is also clear that transcriptional ‘pausing’ and modulation of elon- 
gation rates plays an important role in the dynamic control of gene 
expression, including splicing./?215412-414 
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H3K27me3 in mouse embryos.*542% Histone H4 lysine 16 
acetylation (H4Kl6ac), a hallmark of decondensed, tran- 
scriptionally permissive chromatin, directly stimulates the 
Dot! histone H3K79 methyltransferase.**° H3.3 variants 
are phosphorylated at S31 in gene bodies for high-level 
activation of rapidly induced genes, shown in macrophages 
to be coordinated with SETD2 methylation of H3K36 to 
effect recruitment and ejection of chromatin regulators. 

Constitutive heterochromatin in genomic regions 
such as the centromeres and telomeres contain high 
levels of H3K20me3 and H3K9me3,??*^" the latter of 
which binds the repressive HPI protein via its chromo- 
domain.?48428 H3K4me3 and H3K9me3 mark imprinting 
control regions.^" Histone sumoylation appears to act as 
a repressive mark by recruiting HDACS to gene promot- 
ers,* and H3K36me is present in nucleosomes along 
the body of transcribed genes, and is necessary for effi- 
cient constitutive pre-mRNA splicing by recruiting the 
chromo domain protein Eaf3 to mediate interaction with 
the splicing machinery.?9.! 

These are just some examples. The patterns are com- 
plex and studies are becoming more sophisticated.?9? 
Targeted deposition or removal of histone modifications 
using CRISPR/Cas9-fusions and related approaches, such 
as single-cell CRISPR screens and chromatin modification 
profiling by mass spectrometry, are starting to allow the 
dissection of causal roles for individual modifications.^?-49? 

An important discovery was that nucleosomes are 
preferentially positioned over exons,^^0-^9 suggesting that 
histone modifications convey exon-specific information, 
and that epigenetic control of gene expression extends to 
the level of individual exons. This offers a mechanistic 
explanation for the observed coupling of chromatin struc- 
ture, transcription and splicing,* including the physical 
co-location of alternatively spliced exons with promot- 
ers,*% and a basis for exon selection by histone modifica- 
tions at different stages of development in different cell 
types and different conditions, ^44 which appears to 
be controlled in part by small RNAs.45+-4%6 

Chromatin-modifying proteins have a profound impact 
on developmental processes because they lie at the func- 
tional center of epigenetic regulatory networks. They do 
not make (although they do convey) locus-specific regula- 
tory decisions but rather are directed by other information 
that does. How histone modification writers and erasers 
select particular nucleosomes at particular genomic posi- 
tions for particular modifications in different cell types 
is unknown, but is likely RNA guided (Chapter 16). The 
histone modifications and nucleosome positioning must 
be tightly controlled during development, as developmen- 
tal trajectories are precise (Chapter 15), although histone 
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modifications are also influenced by metabolic and physi- 
ological  factors.157-40 Moreover, histone-modifying 
proteins are themselves subject to post-translational modi- 
fications.^9^? which suggests yet more layers of develop- 
mental control and environmental tuning. 

Histone modifications are often inherited through mei- 
osis and mitosis, to transmit information between genera- 
tions45.2646346^ (Chapter 17) and to ‘bookmark’ loci for 
reactivation or maintenance of heterochromatin after cell 
division.+65-467 The available evidence is that the parental 
core H3-H4 tetramer is split and segregated strand-spe- 
cifically at the replication fork, that parental histones are 
recycled to sister chromatids and re-incorporated near 
their original positions, maintaining their acetylation and 
methylation marks, possibly asymmetrically to alter cell 
fate in daughter cells.46846 

Replication timing maintains the global epigenetic state 
in human cells.^? It is still unclear, however, how the his- 
tone modifications are inherited, particularly in view of the 
report that epigenetic memory is independent of symmetric 
histone inheritance replication"! although histone-modi- 
fying enzymes remain associated with DNA during repli- 
cation.^?4? Epigenetic marks are also erased and reset with 
every round of transcription (which involves similar disas- 
sembly of or navigation through nucleosomes),169473-476 
again possibly involving RNA direction.75+553477 

The imposition and maintenance of this informa- 
tion clearly involves histone modification writers but 
this does not explain their locus specificity, which prob- 
ably operates to exon level. Whatever the mechanism(s), 
the amount of information involved, and stored in the 
genome, must be enormous. 


DNA METHYLATION 


All four bases of DNA are subject to modifications, 
more than 20 of which have been identified,*7**” the 
most common being methylation of cytosines and aden- 
osines. In bacteria DNA methylation (mainly m?A) is 
used to protect endogenous sequences against restriction 
endonuclease cleavage (Chapter 6), but also has roles in 
DNA replication and gene expression.*8!-48+ Adenosine 
methylation has been reported in protists, plants and ani- 
mals,^55-^? although emerging evidence suggests that the 
source may be microbial or RNA contamination.*93494 
5-Methylcytosine (m°C) occurs widely in eukaryotic 
genomes and is the best studied. In fungi and plants, cyto- 
sine methylation is used to silence viruses and transposons, 


Y Bacterial DNA can also be modified by phosphorothioation, as part 
of an alternative restriction-modification defense system.^*0 
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as well as to regulate development ^95-^?7 and environmen- 
tal responses, 95^? likely related processes. In maize, the 
cycling of transposable elements between active and inac- 
tive states to regulate local gene expression is determined 
by the methylation state of the element,4%55%5! which 
may also be in part the role of TEs in animals. 

Most invertebrate genomes are not heavily methyl- 
ated and some species such as Drosophila and C. elegans 
appear to lack DNA methylation, indicating that it does 
not play a role in their development, although it is used 
for genome defense and gene regulation in other inverte- 
brates.5%25% For as yet unknown reasons, a major evolu- 
tionary transition from fractional to global methylation 
occurred at the origin of the vertebrates,50% as did the 
appearance of regional variation in GC content.!?? 

DNA methylation as a major player in gene regulation 
in mammals first came to light in the 1970s with the obser- 
vation that there is differential methylation of the mam- 
malian X chromosomes.?05506 Later studies showed that 
there is widespread erasure of methylation in the mam- 
malian germ line” and in early development;?/??!6 selec- 
tive reimposition of methylation at different loci including 
enhancers in different cell lineages,75!65!* and reacti- 
vation of genes (and induction of tumors) by a cytosine 
analog (5-aza-cytidine) that cannot be methylated.???! 
Embryonic stem cells maintain their pluripotent state in the 
absence of DNA methylation, but cannot differentiate." 

In mammals, methylation is primarily, but not solely, 
associated with repression of gene expression, nota- 
bly in inactivated X chromosomes,* pericentromeric 
heterochromatin, imprinted loci and the regulation of 
transposons (which are related, as most of the targets of 
methylation are TE-derived) and occurs mostly and sym- 
metrically in cytosines in CpG dinucleotides, except those 
clustered in so-called ‘CpG islands'.?—?* Deamination 
of methylcytosine yields thymidine, which is thought to 
account for the underrepresentation of CpG dinucleotides 
in mammalian genomes.>252 

The sequence symmetry of CpG enables propagation 
of the methylation mark through cell division, which 
combined with its complex interplay with Polycomb 
repressive and other histone-modifying complexes?95?! 
and its differential patterns during development, led to 
the proposal that CpG methylation comprises a pathway 
for cellular memory of transcriptional states.505506 


w In zebrafish, the methylome is erased in oocytes but not in sperm,>°” 
and the methylome pattern is reconstituted in the zygote (apparently) 
to match the paternal pattern.*% Thereafter, methylation appears to be 
constitutive throughout development,5%-!! as it is also in Xenopus"? 
The spreading of X-inactivation from the *X-inactivation center' on 
the X chromosome in females appears to be mediated by methylation 
of LINE elements distributed along the chromosome.?? 


x 
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CpG islands occur mainly in promoters (includ- 
ing those of enhancers??), especially those of broadly 
expressed housekeeping genes,7993259 methylation of 
which correlates negatively with gene expression, although 
repression of these promoters appears to occur primarily 
by H3K27me3 histone modifications.496.531,534 Genes with 
CpG island promoters also have other characteristic epi- 
genetic signatures, including high levels of HAK20mel, 
H2BK5mel and H3K79mel/2/3 at their 5’ end.*5 

In contrast, tissue-specific protein-coding genes usu- 
ally, but not always, lack islands.??9532.555 In transcription- 
ally active genes, CpG islands are devoid of methylation 
and enriched for permissive nucleosome modifications 
such as H3K4 methylation. On the other hand, DNA 
methylation is enriched in the body of highly transcribed 
genes, often associated with H3K36 methylation, 0.55558 
where it influences nucleosome positioning?? and alter- 
native splicing,** phenomena that may be linked. It has 
been recently shown that hypomethylated CpG dinucleo- 
tides preserve an archive of tissue-specific developmental 
enhancers in adult mouse cells, marking decommissioned 
sites and enabling recovery of epigenetic memory, a 
process involving the pioneering factor Fox A and TET2/3 
methylcytosine dioxygenases** (see below). 

Cytosine methylation is carried out by DNA meth- 
yltransferases, of which vertebrates have three: two 
‘establishment’ DNA methylases (Dnmt3a and 3b), and 
one ‘maintenance’ methylase (Dnmtl) that recognizes 
hemi-methylated CpG sites following DNA replica- 
tion. All three are required for embryonic development, 
with mutations causing syndromic developmental neu- 
rological, sensory and immunological defects, and loss 
of Dnmtl in neurons at later stages resulting in cogni- 
tive defects.391.5445 The histone mark H3K36me2 also 
recruits Dnmt3a to regulate intergenic DNA methyla- 
tion?" and H3K23 ubiquitylation couples maintenance 
DNA methylation with replication.5** While DNA meth- 
ylation is thought to be stable, it is cycled at promoters at 
high frequency, suggesting an updating mechanism.54545 

The methyl-CpG-binding protein MeCP2 links DNA 
methylation to histone methylation,*° and is essential for 
brain development and function.5^5» Loss of MeCP2, 
which is encoded on the X chromosome, causes a neu- 
rological disorder called Rett Syndrome with variable 
penetrance in females (due to variable patterns of X inac- 
tivation) whereas its loss in males usually leads to severe 
congenital encephalopathies and early death.5% Dnmt3a 
and MeCP2 originated at the onset of vertebrates, with 
methylation of non-CpG sites being exceptionally high 
in the mammalian brain and regulating highly conserved 
developmental genes, with a likely role in the evolution 
of cognition.^! 
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Dnmt2 was originally thought to be a DNA methyl- 
transferase but is, in fact, a tRNA methyltransferase, and 
it seems likely that the modern DNA methyltransferases 
evolved from an ancient RNA methyltransferase.5%2-554 

Methylcytosine is converted to hydroxymethylcytosine 
(hmC) by TET proteins,*” which can also further oxidize 
hmC to generate 5-formylcytosine and 5-carboxylcyto- 
sine.?* There are three TET proteins in mammals with 
different expression patterns and different targets during 
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development.5?7555 TET proteins hydroxymethylate DNA 
at enhancers and telomeres,’ and TET! and TET2 
associate with Nanog to facilitate reprogramming of 
somatic cells to pluripotency.5%-5% Formation of 5-hMC 
is required in embryonal stem cells for the maintenance 
of pluripotency and inner cell mass specification.>>’ It is 
also required in the brain, especially in Purkinje cells, 
where it is almost 40% as abundant as meC.?9 TET3 is 
present in neurons and oligodendrocytes but absent in 
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FIGURE 14.6 Aberrant methylation patterns in enhancer loci in cartilage chondrocytes from patients with hip osteoarthritis 
(OA) and knee OA, compared to healthy controls (NC). (Reproduced with permission from Lin et al.5 under Creative Commons 


CC BY license.) 
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astrocytes.5* TET3 regulates behavioral adaptation in 
the neocortex, 9 as well as synaptic transmission and 
plasticity in the hippocampus,*% and its loss results in 
increased anxiety-like behavior and impaired spatial ori- 
entation.5% Fear extinction, an important form of reversal 
learning, leads to a dramatic genome-wide redistribution 
of 5-hmC within the infralimbic prefrontal cortex, and 
learning-induced accumulation of 5-hmC is associated 
with the establishment of epigenetic states that promote 
gene expression and rapid behavioral adaptation.565 

DNA methylation patterns have been extensively 
analyzed following the discovery by Marianne Frommer 
and colleagues that bisulfite treatment of DNA converts 
cytosine, but not meC, to uracil, which sequences as 
1,567555 and more recently by direct DNA sequencing 
using nanopore technology, which can distinguish mod- 
ified from unmodified bases.56?57? For this reason (tech- 
nical ease of analysis) and its earlier discovery, DNA 
methylation has been more widely studied than his- 
tone modifications, notably in the “Human Epigenome 
Project, which revealed differences in methylation 
patterns in different cell types and interplay between 
genetic variations and epigenetic state during develop- 
ment and aging,’ in the brain, in cancer and other dis- 
eases such as arthritis2265457—575 (Figure 14.6). Akin 
to the insights now routinely offered by RNA-seq, new 
techniques will increasingly reveal the variety and 
dynamics of epigenetic states and transcription factor 
occupancy at single-cell resolution during development 
and in diseases such as cancer.?76577 

However, as with histone modifications, there is little 
known about the signaling pathways that direct the locus- 
specific imposition or removal of cytosine methylation by 
generic enzymes during development, learning and disease, 
except that the RNAi pathway is involved (Chapter 16). 


THE REGULATION OF DEVELOPMENT 


The roles of the (admittedly at the time vague) organiza- 
tion of chromatin in the regulation of gene expression, 
and the mechanisms that might be involved, were rarely 
considered when the bacterial model was extrapolated to 


Y The link between genetic variations and epigenetic state of regulatory 
elements affecting gene and trait expression is illustrated by the clas- 
sic example of lactase non-persistence in mammals and the selection 
for lactase persistence during aging observed in many Europeans 
and other pastoral cultures, which involves non-coding variations, 
a specific IncRNA (LOC100507600 or Lactase antisense RNA 1), 
RNA interference and DNA methylation in intronic enhancers.%557! 
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developmentally complex eukaryotes. Accordingly, since 
then, the regulation of gene activity by chromatin archi- 
tecture has been viewed predominantly through the lens 
of DNA-binding transcription factors. 

This interpretative lens led to the loose and confusing 
description of many proteins that are required to mediate 
the patterns of gene expression during development, such 
as those that organize chromosomal domains or modify 
chromatin, as ‘pioneer transcription factors’ or ‘tran- 
scriptional co-activators’,3°4+°78>8! despite the fact that 
they have no intrinsic or only vague DNA-binding speci- 
ficity (Chapter 16). The implied assumption biased the 
interpretations of experimental observations and retarded 
the understanding of the control of gene expression dur- 
ing development by placing proteins that are required to 
instruct genome architecture into the same conceptual 
and mechanistic basket as those that bind to specific 
sequences to activate or inhibit transcriptional initiation. 

Appreciating that chromatin modifications play a 
central role in the regulation of gene expression during 
development has also been confused by the term “epi- 
genetic inheritance’, implying that it is separate from 
‘genetic’ (DNA-based) inheritance, obscuring the fact 
that the unfolding cascade of epigenetic modifications 
must be instructed by information that is encoded in the 
genome. 

The key challenges are to consider how much infor- 
mation is required to orchestrate organismal ontogeny 
(Chapter 15) and to identify the pathways that connect 
chromatin modifications, enhancers, effector proteins 
and other layers of genome regulation during ontog- 
eny (Chapter 16). How this information is modulated 
by the environment and during learning is addressed in 
Chapter 17. 
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AUTOPOIESIS 


The word 'autopoiesis' was coined in 1972 by Humberto 
Maturana and Francisco Varela from the Greek acto 
(auto-, meaning “self”) and xoínoig (poiesis, meaning 
“creation”) to refer to the property, ability and process of 
self-replication,! the defining feature of life, the unit of 
which is the cell. 

Cellular life on Earth is estimated to date back at 
least 3.7 billion years,? not all that long after the crust 
cooled sufficiently for water to condense — suggesting 
that either self-replicating entities can evolve quickly in 
an aqueous environment or, as proposed by the British 
astronomers Fred Hoyle and Chandra Wickramasinghe, 
that life evolved elsewhere and arrived on dust parti- 
cles from space, taking hold when the conditions were 
right.* 

As described in Chapter 4, developmentally complex 
eukaryotes appeared and evolved over the past billion 
years, first documented in metazoans as trace fossils of 
sponges around 900 million years ago, then segmented 
wormlike creatures in the Ediacaran Period around 600 
million years ago. Phenotypic diversity exploded 520 
million years ago in the Cambrian when recognizable 
representatives of all modern phyla appear almost “over- 
night”, with rapid secondary radiations also occurring 
after subsequent major extinction events. Design varia- 
tions were explored using a relatively stable proteome, 
albeit not without some innovations and expansions of 
particular gene families. While it is widely thought that 
the greater energy capacity enabled by the photosynthetic 
production of oxygen and electron transport in mitochon- 
dria was an enabling step, or at least a precondition, it 
1s likely that new mechanisms had to evolve to manage 
cellular growth, organization and differentiation into an 
integrated three-dimensional assemblage where each cell 
has a specialized function. 


THE OVERARCHING QUESTION 


Evidently there must be differences in the decisional sys- 
tems that control spatially organized cell division and 
specific commands to exit division and enter differen- 
tiation in multicellular organisms, in contrast to simple 
microbial growth and reproduction allowed by nutrient 


DOI: 10.1201/9781003109242-15 


supply.? These differences must also be reflected in the 
structure of the information held in the genomes of plants 
and animals, which show some universal features: the use 
of ‘enhancers’ to direct gene expression patterns during 
development; the organization of the genome into chro- 
matin territories, topological domains and nucleosomes; 
the post-translational modification of histones and many 
other proteins (Chapters 14 and 16); large numbers of 
“repetitive” sequences derived from transposable elements 
that drive phenotypic innovations in, e.g., vertebrate,>° 
tetrapod,’ mammalian?? and primate? evolution;'?!! the 
partitioning of protein-coding and non-protein-coding 
genes into exons and introns (Chapter 7); and the increase 
in the numbers of small and large regulatory RNAs with 
developmental complexity (Chapters 12 and 13). 

Why are there so many histone variants and modifica- 
tions and how do chromatin marks relate to the expres- 
sion and function of ‘genes’ and ‘enhancers’? How is site 
selection achieved in the deployment of these variants 
and modifications at millions of locations around the 
genome during development? What is the role of different 
types of repetitive elements? Why are there large num- 
bers of small RNAs and IncRNAs expressed in highly 
cell-specific patterns? Are these phenomena intertwined? 

These questions are subsets of the overarching ques- 
tion of the nature and scaling of the information and deci- 
sional hierarchies that direct multicellular development, 
especially since it is the non-coding, not the coding, por- 
tion of the genome that expands with increasing develop- 
mental and cognitive complexity. How much information, 
then, is required to program human development, how is 
it encoded and how is it manifested? 

Consideration of these questions, and even awareness 
of their importance to the understanding of molecular 
biology and genome biology, has been given scant atten- 
tion in the face of the emphasis on 'genes' and proteins, 
the widespread assumption that RNA functions purely 
a link between the two, and the focus of most biomedi- 
cal research on biochemistry, physiology and associated 
diseases. However, the specification of metabolic and 


1 Genome engineering has created a bacterial cell with only 473 
protein-coding genes, the smallest genome of any free-living organ- 
ism. Many of these essential genes have unknown functions.* 

^ 6396 of the primate-specific hypersensitive regions (open chromatin 
associated with promoters) are occupied by TE remnants.'° 
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physiological pathways surely comprises a minority of 
the genomic information in developmentally complex 
organisms. Most of the human genome — perhaps not 
unreasonably 99% — must be devoted to the regulation of 
development — generating ~30—40 trillion cells? with 
highly specific arrangements characteristics and connec- 
tions in an adult capable of reproduction and parenting. 
In mammals, the developmental symphony is largely 
conducted in utero and largely completed at puberty, with 
only reproduction, menopause and senescence to follow, 
and physiological homeostasis in between. 

A compounding problem is that most studies of the reg- 
ulation of gene expression in humans and other mammals 
have been carried out with cultured cells, often transformed 
or derived from tumors, and maintained in artificial envi- 
ronments divorced from developmental processes. This is, 
however, changing with the arrival of increasingly sophis- 
ticated systems for in vitro embryonic development beyond 
gastrulation"-'* involving four-dimensional organoid cul- 
tures??? capable of forming (albeit imperfectly) almost 
any type of complex tissue” — such as lung,” liver,” intes- 
tine and colon,?*2% kidney,” retina?’ and brain?*? — with 
well-defined signaling environments and development 
recapitulation protocols, as well as strategies for *program 
unlocking’ with cell de-differentiation, differentiation and 
trans-differentiation. Cerebral organoids derived from 
human, gorilla and chimpanzee cells have been used to 
study developmental mechanisms? driving brain expan- 
sion, although there are still major limitations to these 
systems and barriers to be overcome.? 


TISSUE ARCHITECTURE AND CELL IDENTITY 


It is commonly asserted that humans have ~200 differ- 
ent cell “types” — muscle cells, bone cells (osteoblasts and 
osteoclasts), neurons, astrocytes, keratinocytes, fibro- 
blasts, hepatocytes, etc. (see, e.g., ^). However, this a 
gross underestimate of the range of spatial identities of 
cells that follow different and highly precise trajectories 
during development, as Waddington envisaged.^ The 
differential expression of Hox genes gives some insight 
into the positional specificity of superficially similar 


* With ethical implications and limitations increasingly more promi- 
nent than technical constraints.?! 

d This study found that ZEB2 is an evolutionary regulator of the key 
delayed morphological transition with shorter cell cycles underly- 
ing human brain organoid expansion.*+ ZEB2 is a highly conserved 
“master transcription factor” involved in the epithelial-mesenchymal 
transition (EMT), which has been under positive selection pressure 
in the primate lineage? and is regulated by non-coding RNAs,?- as 
are other aspects of EMT?*-"! (Chapter 16). 
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cells: there are 118 subtypes of neurons in C. elegans 
(which has only 308 neurons), each of which expresses a 
different combination of Hox genes.* Human skin fibro- 
blasts express different HOX genes in different parts of 
the body, reflecting their “positional identities relative to 
major anatomic axes”, which are cell autonomous and 
epigenetically controlled.+748 

Although the transcriptional profile of only a small 
fraction has been examined, the increasing resolution of 
single-cell RNA sequencing and chromatin analyses is 
confirming the large range of cell states and identities at 
different developmental stages, and in different regions 
of the brain,??59 revealing the differential expression not 
only of transcription factors, Hox genes and other regula- 
tory proteins, but also the highly restricted expression 
of enhancers,? IncRNAs and TE-derived RNAs.5760-68 

An example is the expression of the imprinted mater- 
nal allele-expressed IncRNA Meg3/Gtl2 during the 
embryonic development of the inner ear, which is “local- 
ized to the spiral ganglion, stria vascularis, Reissner's 
membrane, and greater epithelial ridge in the cochlear 
duct", leading to the conclusion that “Meg3/Gtl2 RNA 
functions as a noncoding regulatory RNA ... that plays 
a role in pattern specification and differentiation of cells 
during otocyst development, as well as in the mainte- 
nance of a number of terminally differentiated cochlear 
cell types”. 

Human cortical pyramidal neurons exhibit heteroge- 
neity in their electrophysiological properties.” The liver, 
kidney, lungs, cortex, hippocampus, eyes, tongue, ovaries 
and testes all have different but specific designs. Each 
muscle is comprised of similar cell types, but each has a 
different architecture (Figure 15.1), in which the compo- 
nent cells must have highly precise positional placement 
and identity. 

Bones in particular, while comprised of the same 
types of cells, exhibit extraordinary diversity and preci- 
sion of structure and architecture. Each vertebra in the 
spine has a distinct and characteristic shape with complex 
grooves and projections. The long bones in ribs, arms, 
hands, legs and feet, the short cuboidal bones in the wrist 
and ankle, the flat but still precisely curved bones of the 
face, cranium, shoulder blades, sternum and pelvis, and 
the exquisitely fine ossicles of the inner ear, all have spe- 
cific irregular shapes. Moreover, different bones have dif- 
ferent internal structures — dense lamellae, honeycombed 


* Single-cell RNA sequencing is also allowing the investigation of the 
specialization and divergence of cell types across a range of evolu- 
tionary distances,*%% revealing greater complexity of gene expres- 
sion and regulation in humans relative to mouse.?! 
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trabeculae, or marrow — adapted to different weight-bear- 
ing, torsional or other functions, with a legion of tendons 
and muscles to match. 

Equivalent bones in other species are different from 
those humans, but just as idiosyncratically and precisely 
sculpted (Figure 15.2). The precision of this architecture 
is likely to require much more, and far more precise, 
information than simply the differential expression of 
Hox proteins or relatively generic transcription factors 
(TFs) such as the muscle specific MyoD (see below). 

The challenge for mammalian development is to 
direct the ontogeny, architecture, arrangement, and 
interconnections of each of the bones, muscles, and 
organs from a single fertilized cell to a functional adult. 
Cell state and fate has to be computed and specified by 
the genome at every stage, tuned by cell-cell interac- 
tions and feedback loops (involving intercellular ligand- 
receptor signaling, cell contact area detection? and 
other forms of communication, including bioelectric 
circuits and mechanical forces/^79), without which the 
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The exquisite patterning of human muscles (Image: Shutterstock). 


endogenous program would be degraded by stochastic 
errors.! 


PROGRAMMED ONTOGENY 


While embryogenesis has been studied in many animals, 
in only one has the ontogeny of all cells in the adult been 
characterized to date.* This is the nematode C. elegans, 
where John Sulston and Robert Horvitz determined the 
progression from the fertilized egg to the 959 and 1,031 


£ A thought experiment is to imagine the difference between a robot 
programmed to build a motor vehicle that has no feedback or sensory 
mechanisms, versus one that does and can refine its actions accord- 
ingly. This explains why disruption of cell-cell signaling leads to 
aberrant development, why morphogenetic gradients of diffusible 
factors are employed and why sequential expression of Hox proteins 
is used to specify positional identity,” but does not mean that posi- 
tional information conveyed by cell-cell communication is the main 
mechanism of developmental control, as is often assumed.” 

= Cell divisions in the early stages of embryogenesis (up to gastrulation) 
have been documented in the ascidian Phallusia mammillata.57? 
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FIGURE 15.2 The skeleton of the prehistoric cave bear, showing the unique architecture of each bone. (Image from the 1906 
Annual Report of the Director to the Board of Trustees of the Field Museum of Natural History, Chicago.) 


cells that comprise the adult male and female hermaph- 
rodite, respectively?-*? (Figure 15.3). The results show 
that the ontogeny of C. elegans is tightly programmed 
and (mutations aside) invariant, which is required as the 
positions and capacities of every cell are critical to the 
functioning of the organism. 

Imagine the complexity of the equivalent graph for a 
human. 

Programmed ontogeny occurs with high precision in 
all animals, including humans, evidenced not only by the 
architectural reproducibility of the myriad of muscles, 
bones and organs but also by the fact that monozygotic 
twins are essentially phenocopies (Figure 15.4), with the 
same body and facial shape, dimples and hairlines, etc., 
notwithstanding a small number of post-zygotic muta- 
tions? and context-dependent clonal amplifications of 
such cells as adipocytes or lymphocytes.^ The ontogeny 
of a human must be as precise and hard-wired, or nearly 
so, as that of the worm, even if orders of magnitude more 
complex and multi-layered over many more cell divi- 
sions, which implies orders of magnitude more regula- 
tory information. The same must apply to carp, axolotls, 
crocodiles, iguanas, chickens, emus, eagles, echidnas, 
dogs, horses, whales, etc., each with idiosyncratic ontog- 
eny graphs. 

The phenotypic reproducibility means that there 
may be as many as -10^13 distinct cell identities (as 
opposed to cell 'types in a human,'^P? and -10^13 
leaves (Waddington valleys) in the ontogeny graph, 


compared to 10^3 in C. elegans. Every decision to divide 
or differentiate must be calculated with high precision.! 
Developmental reproducibility also shows that, if there 
is any “noise” in gene expression, as often claimed, such 
noise does not have significant impact on phenotype. 

Indeed, it is likely that the systems that control devel- 
opment have evolved to suppress noise,%%- as happened 
in the transition from analog to digital computers.?? 
Developmental programming must also be extremely 
robust, to ensure that enough progeny develop to repro- 
ductive capacity, especially in birds and mammals, 
which have limited numbers of offspring. Indeed such 
robustness appears, in part, to be provided by conser- 
vation and resilience of TF networks as well as by 
miRNAs and siRNAs?5-?? and ‘shadow’ enhancers,?8-102 
which provide yet more examples of the non-random 
exaptation and non-neutral evolution of transposable 
elements.90.103.104 

Therefore, most of the genomic information required 
for development is that which orchestrates cell lineages, 
cell fates and tissue architectures: the feed-forward pro- 
gramming of decisions at each stage that specify when 
and in what plane cells divide, the developmental trajec- 
tory it is in (for example, mesoderm or ectodermal cell 
types) and when it ceases to divide and enters terminal 
differentiation to form a myoblast, fibroblast, osteocyte 
or a neuron. In addition to cell fate trajectories there must 
be other information that determines how a cell is posi- 
tioned in relation to adjacent cells and, in the nervous 


^ The phenotypic diversity of humans (leaving aside body weight) 
must therefore be a function of variations in genome sequences, 
which, on average, differ by 0.1%.** 


! Embryonic ontogeny requires establishment of asymmetry in the 
zygote, which is achieved in different ways in different lineages 9-5? 
and asymmetry of cell division generally.5* 
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FIGURE 15.4 The precision of human ontogeny: monozygotic twins. (Images: Shutterstock.) 
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system, with which other cells synaptic connections are 
made. 

The organizing centers of cell division in animals 
are the specialized ribonucleoprotein organelles called 
centrosomes (see below). They are associated with the 
eukaryotic centromere, a (usually) central region of 
highly specialized chromatin that are also associated 
with internal granules called 'centrioles', which attach to 
kinetochores for spindle formation and chromatid pair- 
ing and separation to daughter cells during mitosis and 
meiosis.'% Centromeres from phase-separated domains!% 
comprised of long tandem arrays derived from retrotrans- 
posons,'% contain specialized histones, express non-cod- 
ing RNAs, and are epigenetically controlled by complex 
networks of non-coding RNAs and the RNA interfer- 
ence pathway.!'%-123 In animal cells, centromeres are 
attached to the centrosomes, which are not essential 
for spindle formation or chromosome segregation but, 
importantly, contain cell cycle regulators and signal- 
ing molecules, many of which are post-translationally 
modified in ‘intrinsically disordered regions’ (Chapter 
16). Centrosomes are connected to and direct the move- 
ments of microtubules and other cytoskeletal structures 
and proteins, including primary cilia,? to control the 
spatial organization of cells, cell division, cell polarity 
and cell migration, with developmental and neurologi- 
cal consequences. !*4-!76 

Centrosomes are essential for the spatially pre- 
cise execution of cell division. The animal centrosome 
evolved with the transition to complex multicellularity, 
as a hybrid organelle with the ability to act as a plasma 
membrane-associated primary cilium organizer and a 
juxtanuclear microtubule-organizing center, enabling the 
connection between and integration of extracellular and 
intracellular signals with cell-autonomous (developmen- 
tally programmed) information.!” 

Plants and animals also have many different types of TFs 
that act to license transcription during development, unlike 
bacteria (and unicellular eukaryotes) where TFs mostly reg- 
ulate genes that are expressed under particular environmen- 
tal conditions, such as the /ac operon. Some eukaryotic TFs 
are undoubtedly required for gene expression responses to 
physiological and environmental variables, but that is only 
a minor function in developmentally complex organisms, 
and it is likely that the mechanisms of transcriptional con- 
trol during animal and plant development are fundamen- 
tally different from those in bacteria.!28.122 

Cell differentiation is readily accomplished by turn- 
ing on stage-specific (stem cell, mesoderm, genital ridge, 
neural crest, muscle, etc.) TFs to express the repertoire 
of proteins required for the function of that cell type 
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— myosins in muscle, keratins in skin, synaptic recep- 
tors in neurons, etc. The differentiation state of cells can 
also be altered, and developmental programs overruled 
or reversed in culture, by ectopic expression of master 
or ‘pioneer’ TFs, such as the conversion of fibroblasts to 
myoblasts (muscle cells) by MyoD!% or to stem cells! by 
the reprogramming ‘Yamanaka factors’, Oct3/4, Sox2, 
KIf4 and c-Myc.!32133 

While this is superficially taken to indicate the pri- 
macy of transcription factors, MyoD and Yamanaka 
factors have little or no DNA-binding specificity, but 
rather change the differentiation state or reactivate the 
pluripotency program* by opening chromatin and/or 
recruiting chromatin modifying proteins, ?-? whereas 
developmentally precise anatomical patterns and cell- 
type specification are dictated in vivo by chromatin state, 
enhancers and regulatory RNAs!*-12 (Chapter 16). A 
compelling example is the patterning of the human face 
and brain. ?-? 

Plant development also needs to be programmed but, 
while exhibiting remarkable diversity and versatility, is 
not and does not need to be as precise as that in animals, 
which have strict design requirements for their mobility 
and (consequently) rapid signal processing (i.e., cognitive 
ability.. By contrast, plant development must have flex- 
ibility to adapt its formation to a constrained location and 
situation, such as available light and other environmental 
challenges, which may account for its more relaxed use of 
retrotransposons and polyploidy. 4-150 


LINEAGE SPECIFICATION 


The systems required for lineage specification began 
to appear in single-celled eukaryotes and basal multi- 
cellular eukaryotes, wherein non-coding RNAs were 
exapted to control cell fate decisions to generate special- 
ized forms in organisms such as yeasts!5!-15 and fungal 
pathogens, 59.57 the social amoeba Dictyostelium discoi- 
deum, 5.5? the ciliate Oxytricha trifallax!99-1? and pro- 
tozoan parasites, the latter notable for their paucity of 
conventional transcription factors.!% 

The life cycle transitions of Capsaspora owczarzaki, 
a unicellular relative of animals with the largest known 
protein-coding gene repertoire for transcriptional regu- 
lation, are associated with changing chromatin states, 
differential IncRNA expression and transcription 


i Interestingly, reprogramming of fibroblasts to stem cells is also mod- 
ulated and (less efficiently) achieved by non-coding RNAs, presum- 
ably related to their role in epigenetic remodeling.'*! 

* Which also occurs at different points during development in vivo, 
such as during the formation of the cranial neural crest.!** 
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factor network interconnections.^ Capsaspora, how- 
ever, lacks animal promoter types and its cis-regulatory 
sites are small, proximal and lack signatures of animal 
enhancers, indicating "that the emergence of animal 
multicellularity was linked to a major shift in genome 
cis-regulatory complexity, most notably the appear- 
ance of distal enhancer regulation". Indeed enhanc- 
ers appear at the dawn of animal evolution,!66-165 and 
enhancers and complex epigenetic regulation of chro- 
matin structure are the defining features of developmen- 
tally complex organisms, along with, and likely driven 
by, the widespread exaptation of transposon-derived 
sequences. 

The foundational transition in animal development! is 
the maternal-zygotic transition (MZT), 5.6 wherein the 
ovum (by far the largest cell produced in most organ- 
isms) supports the growing embryo after fertilization by 
providing the resources required, including the initial 
signals derived from localized maternal proteins and 
transcripts!771%5 and some sperm RNAs."?!599 At all of 
the early stages, miRNAs and IncRNAs (in particular 
TE-derived RNAs) play an important role, from the split- 
ting of the zygote into two-cell and subsequent partition- 
ing into more cells, forming the morula, blastocyst and 
the differentiation of the ESCs in the inner cell mass into 
different lineages.'”° Notably, the maternally loaded tran- 
scriptome shows more variability across species than that 
expressed by the zygote during its middle developmen- 
tal stages, and the maternal transcripts that are degraded 
during the MZT are less conserved than those that remain 
after the onset of ZGA, supporting the ‘hourglass mod- 
el’™ of early development.!/6151-55 The hourglass model 
is also supported by the temporal expression patterns of 
miRNAs in Drosophila.!$5 


HOW MUCH INFORMATION IS REQUIRED? 


A world away, the Boolean network models of gene 
regulation put forward by Stuart Kauffman (Chapter 
5), based on the control of gene expression in bacteria, 
suggested that combinatorial control by 'transcription 
factors’ and other regulatory proteins would suffice to 


! Another important aspect of developmental programming and mor- 
phological innovation is the concept of ‘heterochrony’, modulation 
of developmental timing, in which regulation by non-coding RNAs 
figures prominently.!6?-17 

? The hourglass model of embryonic evolution predicts an hourglass- 
like divergence of gene expression during animal embryogenesis — 
with embryos being more divergent at the earliest and latest stages 
of development but conserved during a mid-embryonic (phylo- 
typic) period that serves as a source of the basic body plan within a 
phylum. 81-182 
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regulate complex developmental processes.!86-192 This 
was supported by Eric Davidson’s analysis of transcrip- 
tion factor expression and gene regulatory networks 
(GRNs) in sea urchin development,!%-1% as well as by 
others,?! for example, in the regulation of Drosophila 
body plan,!” plant flower development!” and the yeast 
cell cycle.199.200 

In bacteria, most transcription factors are repressors??! 
and Boolean logic gates such as AND, OR and NAND 
can be accomplished at promoters by simple combina- 
tions of interactions between just two transcription fac- 
tors,?02.203 such as observed in the lac operon, although 
it is possible to have more than two transcription factors 
and associated binding sites that act independently on 
an adjacent promoter? Consequently, a common pre- 
sumption has been that promoters in complex organisms 
can be targeted by multiple transcription factors (and 
it is clear that this is the case?20%-207), and that an indi- 
vidual transcription factor can address many promoters 
differentially in different cells. How such discrimina- 
tion is achieved is unclear, but posited to be enabled by 
site "accessibility",2062052!9 presumably a function of 
epigenetic processes controlling chromatin structure.?0? 
Similarly, particular miRNAs can target the 3'UTRs of 
many mRNAs and many mRNAs can be targeted by 
many different miRNAs, although the decisional logic of 
such multilateral networks is unclear. 

The theoretical foundations of GRNs can be traced 
to Ptashne's studies in the 1960s of the bistable switch 
that controls the lysogenic and lytic states of lambda 
phage in E. coli.?!'?4 The underlying assumption in 
GRN models of development is that transcription fac- 
tors play largely invariant roles in the GRNs in which 
they function.?? 

This assumption can accommodate alternative splice 
variants for increased complexity but does not accom- 
modate the fact that most eukaryotic transcription 
factors and other proteins involved in developmental 
regulation contain extensive intrinsically disordered 
regions, which are major sites of post-translational 
modifications?^?" (Chapter 16). The presence of IDRs 
in eukaryotic regulatory proteins confers “promiscu- 
1ty”,215218219 ie, flexibility in binding partners, and 
implies, as does the cell-type variation in their binding 
sites,” that the recognition of (or access to) cis-regu- 
latory elements by TFs during development is context- 
dependent, i.e., requires additional information, leading 
to the conclusion that “developmental determination by 
GRNs alone is untenable"?" 

The response to the surprising lack of increase in the 
numbers of genes encoding transcription factors and other 
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regulatory proteins between nematodes and humans" has 
been to assert that the power of ‘combinatorial’ control is 
sufficient to enable “a dramatic expansion in regulatory 
complexity ,??!22 which, although not formalized math- 
ematically nor defined mechanistically, implies a super- 
linear relationship between the number of regulatory 
factors and the number of regulatable events. Moreover, 
in this conception, the developmental programming of 
more complex organisms is assumed simply to involve 
an expansion in the numbers of cis-acting sequences in 
DNA (and RNA) recognized by regulatory proteins, ?!2?? 
which, along with alterations in the expression of these 
proteins and the networks in which they participate, is 
posited to lie at the heart of the development and pheno- 
typic diversity of animals and plants.!92.196.223-230 

In trying to frame the characteristics of such networks 
in abstract terms, many theoretical gene regulatory mod- 
els refer to ‘attractors’, defined as ‘stationary (i.e., stable) 
network states that represent biologically meaningful 
properties, such as cell identity',2002512? which in humans 
presumably number in the trillions. 

The potential number of states in random Boolean 
networks is 2^R where R is the number of regulatory 
variables, and the number of attractor states scales super- 
linearly (R^x for x21).? It is mechanistically implausi- 
ble that any genetic event can be an arbitrarily complex 
Boolean function of other events. Kauffman recognized 
this and constrained his modeling to a maximum number 
of inputs, initially set at two.?*? It was later shown that 
the number of attractor states in Kauffman networks is a 
superpolynomial function of system size.?? Others have 
suggested that combinatorial control by transcription fac- 
tors might be equivalent to a class of computing called a 
Boltzmann machine,?* although the most plausible sub- 
strate for cellular computing is RNA 2? 

In any case, while there is little doubt that expansion 
of the regulatory superstructure underpins the emergence 
and divergence of developmentally complex organisms, 
the implied logic of ‘combinatorial control’ is that the 
number of possible cell states increases exponentially or 
factorially with the number of regulatory proteins. In ani- 
mals this number is so vast — 1000 regulatory proteins is 
potentially 2^1000 Boolean states or 1000! combinations, 
both of which are astronomically large numbers, far 
greater than the estimated number of atoms in the observ- 
able universe, (-10180)* — as to be more than capable, 


n Although we now know that there has been a massive expansion 
in the number of regulatory RNAs with increased developmental 
complexity. 

? The ‘Eddington Number”.2% 
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even if heavily discounted, of providing sufficient power 
to program the development of a worm or a human, so no 
need for further concern or consideration regarding the 
constraints of regulatory proteins on complexity. 

While it is clear that many factors can influence a 
decision to transcribe a gene, so there is some sort of mul- 
tifactorial control,2%! the use of the term ‘combinatorial’ 
is ambiguous. It is by no means clear that regulatory fac- 
tors operate combinatorically? at the point of decision,‘ as 
opposed to the hierarchical, binary and sequential con- 
trol of gene expression in the temporal (developmental) 
sense,2%%2% nor that the amount of required regulatory 
information scales as an inverse factorial as implicitly 
assumed by the former. Indeed, the assumption has never 
been clearly articulated, nor (consequently) been justified 
or formalized mathematically by reference to decision 
theory or control system theory, or mechanistically; nor 
has it been subjected to critical scrutiny, as the assump- 
tion has sat comfortably with mainstream preconceptions. 

Decisional control systems are usually structured as 
binary hierarchies, as are developmental ontogenies, 
which is not inconsistent with sequential Boolean control. 
Sorting algorithms in binary trees scale as NlogN, where 
N is the total number of states (branches) in the ontogeny 
tree. If a functional network maintains the same propor- 
tional connectivity (1.e., where N nodes are connected to 
the same fraction of other nodes in the network), the num- 
bers of connections, presumably transacted by regulatory 
factors, scales as N^2. Alternatively, the number of deci- 
sions in an ontogeny tree scale as a geometric function 
[24(N+1) - 1] (-2^N) of the number of branches, whose 
exact number may depend on the length of the branches 
but, in any case, is a very large number. 

How doesthe amount of regulatory information scale with 
organismal complexity? The empirical evidence suggests 
that it is the opposite to that which is commonly assumed. 
Prokaryotic genomes are predominantly composed of pro- 
tein-coding sequences, and therefore it is possible to do a 
first approximation analysis of the relationship between the 
numbers of genes encoding regulatory factors (ignoring the 
small number of regulatory RNAs) and the total numbers of 
genes in cells of different genetic complexity. 


P Although transcription factor pairings can alter their target 
specificity.237258 

4 That is, the regulation of the expression of individual genes at partic- 
ular levels. This conception of combinatorial control implies a “vot- 
ing function’ at regulatory elements in the gene promoter, and other 
sites of regulatory action. Indeed, such a voting function breaks down 
into Go (activate), No- Go (repress) and Tie-breaker, after which any 
more inputs are redundant, which fits with the observed two-three 
factor transcription factor control of bacterial gene expression. 
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FIGURE 15.5  Quasi-quadratic relationship between genes encoding regulatory proteins and total numbers of protein-coding 
genes in 683 prokaryotic genomes.?? The analysis was computationally assessed by comparing all predicted proteins with Pfam 
annotated regulatory domains (such as DNA-binding domains) with all annotated open reading frames. The insert is the linear 
relationship. The main plot is a log-log graph of the data. The line of best fit has a slope of 1.96 (R « N^1.96). (Analysis by Larry 


Croft, reproduced with permission.) 


Consistent with an NlogN or N^ relationship, the num- 
bers of regulatory proteins R (as assessed computationally 
by the presence of DNA and RNA binding domains) in 
bacterial genomes increases quasi-quadratically with gene 
number G: R2 O(G^1.96) 9?" (Figure 15.5). 


CONSTRAINTS IMPOSED BY 
THE SUPERLINEAR SCALING OF 
REGULATORY INFORMATION 


The superlinear scaling of regulatory information has 
many implications for the evolution and development of 


* [tis difficult to distinguish NlogN from N^2 over two orders of mag- 
nitude of bacterial genome sizes. 


organisms, and indeed all functionally integrated com- 
plex systems.2^24824? Since the number of genes involved 
in gene regulation scales roughly twice as fast as the total 
number of genes, and there is no hint of any deviation at 
the top end of the range, there must be a limit (as the num- 
ber of regulatory genes cannot exceed the total), which is 
ostensibly (and likely explains) the observed upper limit 
of bacterial genome sizes (about 10 Mb, ~9000 genes), 
where over 20% of the genes are regulatory.’ This limit 
was also likely reached early in evolution.” 

There is no reason to think that eukaryotes can or have 
escaped the greater than linear scaling of regulatory factors 


* Presumably the point where the added benefit of increased genetic 
complexity is offset by the disproportionally higher burden of regu- 
latory overhead. 
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with complexity, and the number of transcription factors 
in eukaryotes also exhibit a power law relationship with 
the number of protein-coding genes, albeit with a lower 
exponent [R=0(N11.23)], although (and likely because) 
it is presently impossible to account the number of genes 
encoding regulatory RNAs in eukaryotic genomes.?»025! 

In any case, to be able to realize the benefits and eco- 
logical opportunities of multicellularity, complex eukary- 
otes must have found solutions to the scaling problem of 
constructing elaborate decision trees in developmental 
programming. One is to introduce additional layers of 
regulation and added flexibility for regulatory evolution, 
which has clearly occurred, in respect of chromatin state, 
epigenetic and other post-translational modifications, 
enhancers, IncRNAs and miRNAs. The second is modu- 
larity, enabling the reuse of subroutines where possible. 
The third is to separate the regulatory signal from the 
analog functions directed by it, to use the infrastructure 
most efficiently. The fourth is compartmentalization, 
which appears to be far more widespread and sophisti- 
cated than realized from simple ultrastructural images of 
cells, particularly the partitioning of the cell and chroma- 
tin into phase-separated domains that have the potential 
to focus interactions and reduce noise and cross-talk in 
gene regulatory transactions?" (see next chapter). This is 
fertile ground for both theoretical and mechanistic stud- 
ies of how autopoietic programs are best designed and 
executed. 

If prokaryotes and eukaryotes are subject to the 
same laws of regulatory scaling, as the developmen- 
tal complexity of organisms' increases, the fraction of 
the genome devoted to regulation must increase, which 
would, in principle, explain the genomic expansion and 
increasing domination of non-protein-coding sequences 
across orders of magnitude of developmental complexity. 


HOW MUCH INFORMATION IS 
THERE IN THE HUMAN GENOME? 


It is worth reflecting on the amount of information that is 
stored in the genomes of humans and other animals. In 
computational terms, the haploid human genome contains 
around 6.6 gigabits of data (after conversion of 3.3 billion 
AGCT nucleotides to binary characters — 00,01,10,11) or 
825 megabytes, less than required for the storage of a few 
hundred images, and far less than the capacity of a smart 
phone. Put in these terms, it seems incredible that such a 
compact suite of data can program the development, phys- 
iology, cognitive capacity and reproduction of a human. 


t Metabolic, differentiation, developmental or cognitive complexity. 
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Indeed, it is likely that genomic information is dense in 
complex organisms, as it is in bacteria and viruses. 


GENOMES AS .ZIP FILES OF 
TRANSCRIPTOMES 


We suggest that not only is the vast majority of the 
human genome devoted to the regulation of development — 
it is not junk — but also that the necessary huge expan- 
sion in regulatory information could only be achieved by 
separating regulatory signals from consequent actions 
(aided by the emergence of the nucleus and the physi- 
cal separation of transcription and translation in eukary- 
otes — Chapter 4), using RNA to direct generic effector 
proteins to specific locations in the genome and in other 
RNAs, exemplified in simple form by the RNAi and 
CRISPR systems. Indeed, the ability to target a regula- 
tory action to different sites using RNA guides, either 
small RNAs or modules within IncRNAs (Chapter 16), 
constitutes a large increase in regulatory power at mini- 
mal cost, at the same time also enabling enormous flex- 
ibility in developmental programming and evolutionary 
adaptation. 

Chiara Alberti and Luisa Cochella summarized it well 
with respect to miRNAs (which can be applied equally to 
IncRNAs - our addition): 


we present a view of miRNAs (and IncRNAs) in 
the context of development as a hierarchical and 
canalized series of gene regulatory networks. 
In this scheme, only a fraction of embryonic 
miRNAs [and IncRNAs] act at the top of this 
hierarchy, with their loss resulting in broad devel- 
opmental defects, whereas most other miRNAs 
{and IncRNAs] are expressed with high cellu- 
lar specificity and play roles at the periphery of 
development, affecting the terminal features of 
specialized cells.2% 


Finally, we suggest that genomes are better viewed not 
as repositories of protein-coding genes but highly com- 
pacted transcriptomes that are unzipped during devel- 
opment. Indeed, as we show in the next chapter, the 
production of an army of small RNAs and IncRNAs, 
interacting with an expanded repertoire of effector pro- 
teins, lies at the heart of all regulatory and organiza- 
tional adaptations that allowed eukaryotes to access and 
explore the dimensions of multicellularity and develop- 
mental complexity, and regulate every aspect of their 
four-dimensional ontogeny. These innovations comprise 
a major addition to the nature of genetic information, at 
the same time increasing the robustness and evolvability 
of these systems. 


The Programming of Development 197 


FURTHER READING Fedoroff NV (2012) Transposable elements, epigenetics, and 
ine genome evolution. Science 338: 758-67. 
Maturana H.R. and Varela EJ. (1972) Autopoiesis and Erwin DH and Davidson EH (2009) The evolution of hierarchi- 


Cognition: The Realization of the Living (D. Reidel cal gene regulatory networks. Nature Reviews Genetics 
Publishing Company, Dordrecht). 10: 141-8. 
Salthe S. (1993) Development and Evolution: Complexity and Alberti C and Cochella L (2017) A framework for under- 
Change in Biology (MIT Press, Cambridge, MA). standing the roles of miRNAs in animal development. 
Edelman GM (1978) The Mindful Brain: Cortical Organization Development 144: 2548—59. 


and the Group-selective Theory of Higher Brain Function 
(MIT Press). 


Taylor & Francis 
Taylor & Francis Group 


http://taylorandfrancis.com 


1 6 RNA Rules 


RNA IS A CORE COMPONENT 
OF CHROMATIN 


DNA and proteins have been the focus of the study of 
chromosome structure, but RNA is also a major com- 
ponent, essential to the organization of chromatin and 
the ‘nuclear matrix'.'-? As long ago as 1989, Sheldon 
Penman and colleagues demonstrated that transcription 
is required to maintain nuclear structure, and that chro- 
matin integrity is destroyed by treatment with RNase, 
noting that “ribonucleoprotein granules were dispersed 
throughout the euchromatic regions" and suggesting 
"that RNA is a structural component of the nuclear 
matrix, which in turn may organize the higher order 
structure of chromatin’ (Chapter 4). 

Genome-wide mapping and sequencing studies sub- 
sequently showed that there are many chromatin-bound 
RNAs in animal cells and that the locations of long non- 
coding RNAs in chromatin are “focal, sequence-specific 
and numerous”,’? with thousands of “tightly associated" 
non-coding RNAs tethered adjacent to active genes.!! 29?! 
Well-studied non-coding RNAs such as 7SK, U1, B2 and 
Alu RNAs, Gas5 and SRA, and more recently a large 
coterie of enhancer-derived and other IncRNAs, have 
been shown to be involved in the regulation of transcrip- 
tion initiation, elongation, termination and splicing.??~*> 
The stress response induces the transcription down- 
stream of protein-coding genes of thousands of IncRNAs 
that remain chromatin bound.”® Chromatin-associated 
RNAs, which include those transcribed from enhancers 
and repeats, have been shown to have roles in genome 
organization via enhancer-promoter interactions and 
the formation of transcription hubs, heterochromatin 
and nuclear bodies (or 'granules')!- 95.162127? through 
their interaction with proteins containing intrinsically 
disordered regions and the formation of phase-separated 
domains, as set out below. 


REGULATION OF CHROMOSOME 
STRUCTURE 


The scaffolding of euchromatin involves highly abun- 
dant (*CoTIl) repeat RNAs, predominantly from 5' 
truncated LINE elements; *? the expression of which 
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varies during development and is regulated by other 
RNAs.??* Chromatin-associated RNA proximity liga- 
tion reveals an RNA-DNA contact map similar to that 
observed by DNA-DNA ligation in topologically asso- 
ciated domains.? LncRNAs have been shown to regu- 
late TAD formation, and a recent analysis identified 
more than 10,000 RNA-chromatin interactions mediated 
by protein-coding RNAs and non-coding RNAs.** The 
RNAi machinery has also been shown to regulate nuclear 
topology.?940 

Many binding sites for CTCF, a zinc-finger contain- 
ing protein (see below) that appears to anchor boundary 
sequences in TADs*! (Chapter 14), are derived from trans- 
posable elements? and transcriptionally active HERV-H 
retrotransposons demarcate TADs in human pluripotent 
stem cells.? Similar to that observed with ‘enhancer’ 
RNAs (see below), IncRNAs have been reported to 
regulate neighboring genes through interaction with the 
Mediator complex,^^^ a master coordinator of transcrip- 
tion and cell lineage commitment that also organizes 
chromosome topology (Chapter 14). 

LINE- and centromere-derived repeat RNAs are 
structural and functional components of centromeric 
chromatin.*%-% Heterochromatin formation generally 
requires the expression of repetitive sequences? and 
the RNAi pathway?!'^^ and RNA binding is required 
for heterochromatic localization of HP! and the Suv39h 
histone methyltransferase.?6? Chromatin compaction is 
also controlled by IncRNAs that target IAP retrotrans- 
posons.9? Telomere formation and maintenance requires 
specialized non-coding RNAs,9^9? as does pairing of 
homologous chromosomes in meiosis6%% and many, if 
not most, chromatin-associated proteins bind RNA,9 
including those involved in other chromatin-regulated 
process such as DNA stability and damage repair. 


RNA GUIDANCE OF CHROMATIN 
REMODELING 


Chromatin structure is modulated during development by 
“pioneer transcription factors” that alter cell fate in plants 
and animals by targeting nucleosomes and/or common 
DNA motifs.9"?! The best known examples of reprogram- 
ming proteins are the ‘Yamanaka’ factors, Oct4 (Pou5f1 


199 


200 


gene), Sox2, KIf4 and c-Myc, which are (collectively) 
capable of converting differentiated cells to “induced plu- 
ripotent stem cells’ (iPSCs),7- a process enhanced by 
inclusion of the RNA-binding protein, Lin28.? Oct4 is 
also involved, inter alia, in the differentiation of pluripo- 
tent cells to form the cranial neural crest.’ 

Another key pluripotency and reprogramming factor is 
Nanog, a homeobox-containing protein." Homeoboxes 
are helix-loop-helix DNA-binding domains that exhibit 
a preference, but not specificity, for the common motif 
TAAT,9% in the case of Nanog TAAT(G/T)(G/T).*! 
Oct4 is also a homeobox-containing protein that recog- 
nizes the loose consensus sequence TTT(G/T)(G/C)(T/A) 
T(T/A), which occurs at thousands of sites around the 
genome.82-8 

The expression of Oct4, Nanog and other pluripotency 
factors? is regulated by non-coding RNAs,” including 
pseudogene-derived IncRNAs,%-% one of which recruits 
the histone-lysine N-methyltransferase SUV39H1 to 
epigenetically silence Oct4 expression?7?* Reciprocally 
Oct4 and Nanog regulate the expression of IncRNAs that 
modulate pluripotency.” Oct4 and Nanog also have mul- 
tiple pseudogenes,!*-103 some of which are differentially 
expressed in pluripotent and tumor cell lines.!9?.104 

There are 16 classes of genes encoding homeo- 
box proteins in animals, 11 in plants, with hundreds of 
orthologs in the human genome, most of which contain 
additional domains.*? As noted already (Chapter 5), Hox 
proteins are ‘master controllers’ of gene expression pat- 
terns during animal and plant development, and regulate 
the expression of many genes at different developmental 
stages. While they recognize similar sequences in vitro, 
Hox proteins display wide functional diversity and iden- 
tification of their in vivo genomic targets has proven elu- 
sive, as has the identification of the targets of Oct4 and 
Sox2.50,83.105-109 Analysis of genome-wide DNase I hyper- 
sensitivity profiles and transcription factor (TF)-binding 
sites identified 120 and validated eight ‘pioneer’ TF fam- 
ilies that dynamically open chromatin (including Sox2, 
Oct4 and Hoxal11), and identified ‘settler’ TFs (including 
c-Myc), and the nuclear hormone receptor RXR:RAR 
and NF-«B families, whose genomic binding is depen- 
dent on chromatin opening by pioneer TFs.!!? 

The targets of Sox2, Oct4, Nanog and other Hox 
proteins change with developmental stage/7^!!.!? all of 


a These factors have distinct roles in cell lineage specification" and 
the regulation of their expression is intertwined.7+85%55-% Nanog 
exerts its action in part via TET1/2 methylcytosine hydroxylases.?? 

^ Noncoding RNAs also regulate the expression of the nuclear hor- 
mone receptor ESR! and the CEBPA (CCAAT enhancer-binding 
protein alpha). 
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which suggests that other factors are involved in deter- 
mining their locus specificity. In this context, it may not 
be an outlier observation that the Drosophila Hox pro- 
tein Bicoid (which controls anterior-posterior patterning) 
binds RNA via its homeodomain,!?!'^ nor that highly 
conserved IncRNAs are produced in vertebrate endo- 
derm lineages from paralogous regions in HOXA and 
HOXB clusters.! ^ 

Sox2 is a member of a subclass of *high mobility group' 
(HMG) proteins, the most abundant chromatin-associated 
proteins after histones. HMG proteins bend DNA struc- 
ture, initiate chromatin opening and facilitate nucleo- 
some remodeling. 6-5 There are three classes, one of 
which (HMG-A) is abundant in embryonic cells and binds 
AT-rich sequences, another (HMG-N) binds nucleosomes, 
and the third (HMG-B, which includes the Sox proteins) 
binds the DNA helix minor groove with no sequence 
specificity.119120 Sox2 influences development not only in 
pluripotent stem cells but also in the lung, ear and eye, and 
in neural lineages, but how it and other HMG-B proteins 
achieve their tissue-specific versatility is unclear.!% 

While Sox2 has low affinity for DNA, it binds RNA 
with high affinity through its HMG domain, ^? as do 
other members of the HMG-B family,'2 “which requires 
a reassessment of how these proteins establish proper 
patterns of gene expression across the genome"?! The 
HMG-B domain of the mammalian sex-determining 
protein Sry is homologous to the RNA-binding domain 
of a viral protein,"^ suggesting that their target selection 
in vivo is guided by trans-acting RNA signals. There are 
well-documented examples of IncRNAs that interact with 
Sox2 to regulate pluripotency, neurogenesis, neuronal dif- 
ferentiation and brain development,!'22125-128 and a IncRNA 
has been shown to interact with a chromatin-remodeling 
complex to induce nucleosome repositioning.!? 

Sox2, Nanog and Oct4 are often found at super 
enhancers! and state-specific differences in enhancer 
activity correspond with reconfiguration of Sox2, 
Nanog and Oct4 binding and target gene expression.!!! 
The IncRNA Evf2 selectively represses genes across 
megabase distances by coupling recruitment and 
sequestration of Sox2 into phase-separated domains 
(see below), affecting enhancer targeting and activity, 
with genome-wide effects.'?? In human embryonic stem 
cells Oct4 and Nanog associate with transcripts of the 
human endogenous retrovirus subfamily H (HERV-H) 
transposable elements, which are required to maintain 
stem cell identity and whose terminal repeats function 
as enhancers.!31,132 

The classic master switch transcription factor, MyoD, 
which can reprogram fibroblasts into muscle cells and is 
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central to muscle differentiation in vivo,! is regulated by 
IncRNAs,'34-6 as are other aspects of muscle gene expres- 
sion.!37-132 The pioneer transcription factor CBP also 
binds RNAs, including those transcribed from enhanc- 
ers, to stimulate histone acetylation and transcription. 

Nucleosome repositioning and remodeling is accom- 
plished by the ATP-dependent imitation switch (ISWD, 
chromodomain helicase DNA-binding (CHD), SWI/ 
SNF (switch/‘sucrose non-fermentable’) (SWI/SNF) and 
INO80 complexes.!*!142 These complexes are directed to 
specific sites in chromatin or antagonized by IncRNAs, 
including Xist and enhancer RNAs, in processes as 
diverse as rRNA synthesis, myogenic differentiation and 
proliferation, endothelial proliferation, migration and 
angiogenic function, atherosclerosis, cardiomyopathy, 
liver regeneration and stem cell renewal, immunity and 
inflammation, and various cancers,!2!36.143-160 leading 
one group to conclude that "every cell type expresses 
precise IncRNA signatures to control lineage-specific 
regulatory programs "6 

However, the patchy data on the binding of RNAs 
by the various proteins that control chromatin remodel- 
ing during development reflects limited investigations 
because of the expectation that all “transcription factors” 
bind to DNA, rather than be directed by RNA-DNA and 
other RNA-mediated interactions. 


GUIDANCE OF TRANSCRIPTION FACTORS 


Loose DNA sequence specificities are a feature of eukary- 
otic transcription factors generally. While eukaryotic 
genomes are orders of magnitude larger than those of pro- 
karyotes, their more conventional TFs have shorter DNA 
recognition sites (6—10bp versus 15-25bp in E. coli'*'), 
often expressed as a ‘consensus’ sequence, but better rep- 
resented by multiple sequences, with many TFs recogniz- 
ing different primary and secondary motifs.162-166 

Moreover, high-throughput chromatin immunoprecip- 
itation experiments with antibodies against specific TFs 
show different patterns of binding in different cell types, 
so additional factors must be involved. Such factors can 
be either (or both) trans-acting signals or chromatin 
accessibility, the latter supported by the observation that 
TF-binding sites are nucleosome depleted and DNase- 
sensitive, indicating that epigenomic decisions precede 
TF factor binding.!67-170 

The largest class of TFs in animals and plants con- 
tain ‘zinc-finger’ (ZF) domains; specifically the C2H2 
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class, of which there are over 700 encoded in the human 
genome, and which recognize more sequence motifs 
than all other transcription factors combined.!”! Human 
C2H2-ZF proteins contain an average ~10 C2H2 domains 
(ranging from 1 to 30), classified into three groups: 'tri- 
ple’, ‘multiple-adjacent’, and ‘separated-paired’ C2H2 
finger proteins, enabling some to bind multiple ligands. 

It is thought that most ZFs bind to DNA, although 
most of the binding sequences are unidentified,!© but 
many ZFs also bind to RNA or protein, and some to RNA 
only.!62.772.73 The classic example is TFIIIA (Chapter 8), 
which is required for the transcription of 5S rRNA genes 
and is titrated off DNA by its higher affinity for 5S 
tRNA, the first demonstration of the regulation of TFs 
by RNAs.174175 

A large fraction of C2H2-ZF TFs have been shown to 
regulate alternative splicing."* A splice variant that intro- 
duces three additional amino acids (KTS) between the 
third and fourth ZFs of the Wilm's tumor protein WT14 
changes the specificity of the WTI protein from DNA 
to spliceosomes,"* presumably by binding RNA, given 
that WT1 also contains an RNA recognition motif!”? and 
transcription and splicing are coupled.'®° Disturbance 
of the ratio of +/-KTS isoforms causes a developmental 
syndrome, affecting kidney and genital development.'*! 
Both isoforms bind DNA and RNA in vitro,3?-!*^ shuffle 
between the nucleus and translating polysomes in the 
cytoplasm, and their subnuclear location is RNase- but 
not DNase-sensitive.!8? 

A 1994 analysis by Yigong Shi and Jeremy Berg of 
two representative C2H2-ZF proteins, one of which was 
Spl (which controls the expression of many housekeep- 
ing, tissue-specific, cell cycle and signaling pathway 
response genes!8), showed that they have a higher affin- 
ity for RNA-DNA hybrids than for double-stranded DNA 
and that this increased affinity was strand-specific, i.e., 
dependent on which strand is RNA.!?7 

The C2H2-ZF transcription factor Y Y1, which regu- 
lates the expression of various genes during embryo- 
genesis, cell differentiation and proliferation,5* binds 
chromatin in an RNA-enhanced fashion!“ and appears 
to play a major role in mediating enhancer-promoter 
loops.?? YY1 also interacts with an RNA-binding pro- 
tein involved in splicing regulation, depletion of which 
attenuates YY1 chromatin binding and Y Y1-dependent 
DNA looping and transcription. The ZF-containing 
TAD insulator CTCF has also been shown to be a high- 
affinity RNA-binding protein.65191-19%5 


* So-called because they have a domain shaped like a finger that is 
structured by a coordinated zinc ion. 


d Frequently mutated in pediatric kidney tumors and urinogenitary 
developmental disorders."? 
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Later studies confirmed that over 800 human proteins 
bind RNA-DNA hybrids and over 300 prefer binding 
RNA-DNA hybrids over dsDNA.!% These observations 
raise the possibility, if not the likelihood, that trans- 
acting RNAs are involved in the exposure and selection 
of genomic TF-binding sites, explaining the differential 
locus specificity of TF binding and the reason for a loose 
consensus sequence, as well as enabling directionality of 
action by strand selection. 

RNA-DNA hybrids! (which form ‘R-loops’ with the 
displaced DNA strand) occur widely throughout the 
human genome??5/5? and even encompass 8% of the yeast 
genome.?? RNA-DNA hybrids are enriched at unmeth- 
ylated CpG-rich promoters, transcription start sites and 
regions enriched for activating histone modifications such 
as H3K4mel/2/3, H3K9ac and H3K27ac.20 RNA-DNA 
hybrids regulate genome stability and DNA repair, ?7202.203 
promoter-proximal chromatin architecture and cellular 
differentiation,” transcriptional activation? and “are 
enriched at loci with ... potential transcriptional regu- 
latory properties ... supporting a model of certain tran- 
scription factors binding preferentially to the RNA:DNA 
conformation”.2% The formation and stability of RNA- 
DNA hybrids are in turn regulated by RNA methylation 
and other modifications!” (Chapter 17). 

Nucleic acid triplex structures? (wherein a single 
stranded DNA or RNA forms ‘Hoogsteen’ hydrogen 
bonds with the purine-rich strand of polypyrimidine- 
polypurine tracts in the major groove of duplex DNA) 
also occur in vivo, as first detected by the sequence- 
specific binding of RNA to ‘native’ dsDNA ?'^ Triplex- 
forming sequences are overrepresented in eukaryotic but 
not bacterial genomes, notably in regulatory regions and 
promoters.?15217 Antibody and sequencing studies have 
also shown that triplex structures abound in eukaryotic 
chromosomes?'5?! (Figure 16.1). Triplex hotspots tar- 
geted by IncRNAs have been proposed to contribute to 
chromatin compartmentalization in conjunction with 


* A subset of which can be specifically addressed by exact match to a 
transacting RNA, either a small RNA or an RNA sequence within a 
longer RNA. 

f Interestingly, the stability of genomic RNA-DNA hybrids in vivo is 
controlled by methylation of the RNA (see below).!” 

s There are other alternative and multi-stranded structures in eukary- 
otic genomes, including Z-DNA (binding domains for which occur 
in RNA editing enzymes, see Chapter 17), G-quadruplexes, I-motifs 
and cruciform structures, which, regrettably, despite the availability 
of specific antibodies, have not been mapped in genome-wide studies 
of genomic features and their dynamic relationship to cell type.208212 
Many of these alternative DNA structures are formed by simple 
sequence repeats, which also abound in the genomes of plants and 
animals.?? 
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FIGURE 16.1 
nucleus of a human monocytic leukemia cell visualized in situ 
by an anti-triplex monoclonal antibody. The bar represents 
Sum. (Reproduced from Ohno et al.2 with permission of 
Springer Nature.) 
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‘architectural’ TFs such as CTCF?” and the positions 
of IncRNA:DNA triplex-forming sites have been shown 
to be predictors for TADs.?? Triplex-forming oligonu- 
cleotides have been shown to alter cell division, inhibit 
tumor growth, stimulate recombination and modulate 
target gene expression.??42?7 

Many IncRNAs, including those expressed from 
enhancers, have been shown to interact sequence-spe- 
cifically with DNA?” to regulate various processes 
through R-loop or triplex formation, including chromatin 
architecture, transcription, radiation response, cell pro- 
liferation, cell differentiation and organ development, in 
some cases (at least) intersecting with epigenetic path- 
ways.20+217229-240 Triplexes are also involved in small 
RNA-mediated transcriptional gene silencing 7^! 

A good example, from plants, is the IncRNA APOLO, 
which coordinates the expression of multiple genes in 
response to cold through sequence complementarity and 
R-loop formation, decoys Polycomb and binds transcrip- 
tion factors at the promoter of a master regulator of root 
hair formation.???^' Amazingly, APOLO function can 
be partly mimicked by the sequence-unrelated IncRNA 
UPAT, which interacts with orthologous proteins in 
mammals, indicating conservation of regulatory struc- 
tures and IncRNA functions across kingdoms.?4* 
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The enhancer IncRNA KHPSI forms a triplex with 
enhancer DNA sequences to activate expression of the 
neighboring SPHK1 gene, by evicting CTCF, which insu- 
lates the enhancer from the SPHK1 promoter. Deletion of 
the triplex-forming sequence attenuates SPHK1 expression, 
leading to decreased cell migration and invasion, and the 
targeting of KHPSI IncRNA can be switched by swapping 
the triplex-forming promoter sequence to other genes.232233 

Other classes of 'transcription factors” such as Y-box 
proteins also bind RNA, and known RNA-binding pro- 
teins such as hnRNP K (better known as a “splicing factor”) 
also act as transcription factors.2%%2% The ‘paired-box’ 
transcription factor Pax5, which is a “master regulator” of 
B-cell development by recruiting chromatin-remodeling, 
histone-modifying and basal transcription factor com- 
plexes to its target genes,?" is hijacked to the Epstein Barr 
Virus genome by a viral-encoded non-coding RNA.7% 
Another 'transcription factor”, the nuclear hormone recep- 
tor ESRI (estrogen receptor o), which is commonly acti- 
vated in breast cancer, is also an RNA-binding protein.?^? 

Dual RNA-DNA or ambiguous RNA/DNA- 
binding proteins also include p53,?25? the ‘guardian 
of the genome', possibly the most intensively studied 
gene and protein in human molecular biology, which 
binds a IncRNA (‘damage-induced noncoding RNA’, 
DINO).2! The dual DNA/RNA-binding protein TLS/ 
FUS (Translocated in LipoSarcoma/FUsed in Sarcoma) 
is allosterically regulated by IncRNA pneRNA-D.?252-254 
Even RNA polymerase is regulated by RNAs. In mam- 
mals, RNA polymerase II is repressed by short RNA 
polymerase III transcripts derived from mouse B2 and 
human Alu repeat (SINE) elements.255-2% These elements 
also provide mobile RNA polymerase II promoters.??? 

Importantly, approximately half (-350) of the human 
C2H2-ZF proteins, many of which are unique to primates, 
contain a KRAB transcriptional repression domain, which 
binds TEs,^ and evolved by recurrent TE capture that part- 
ners them with emergent TE-mediated regulatory net- 
works, influencing genomic imprinting, placental growth 
and brain development.!71:250-27 Most eukaryotic TFs also 
contain intrinsically disordered domains that overlap their 
DNA-binding domain and direct their target specific- 
1ty,127 likely by interaction with guide RNAs (see below). 

Presumably, different types of DNA/RNA-binding 
proteins recognize different types of nucleic acid 


^ KZFPs (KRAB domain-containing zinc finger proteins) control the 
pleiotropic activation of TE-derived transcriptional cis-regulator 
sequences, some of which are primate-specific, during early embryo- 
genesis, in part through histone H3K9me3-dependent heterochroma- 
tin formation and DNA methylation.?9??9! Primate-specific KZFPs 
also regulate gene expression in neurons.?% 
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structures and transact a different type of signal in dif- 
ferent contexts within the decisional systems that control 
cell division and differentiation during development, as 
well as in physiological responses. The fact that most 
eukaryotic 'transcription factors” have confusing and 
enigmatic functions attests to the likelihood that they 
have been interpreted in the wrong conceptual frame- 
work, with RNA the missing link.*” 


GUIDANCE OF DNA METHYLATION 


Transcriptional gene silencing in fungi and plants by 
RNA-directed DNA methylation was well established in 
the 1980s and 1990s (Chapter 12). These studies eventually 
showed that the enzymes that methylate DNA are directed 
to their sites of action by small RNAs interacting with the 
RNAi protein AGO4.277278 Small RNAs (miRNAs, siR- 
NAs and piR NAS) also induce site-specific DNA methyla- 
tion in animals,??5327?28? which again involves Argonaute 
proteins,?81-285 suggesting that what had originated as an 
RNA-based mechanism for defense against viruses has 
been co-opted as a means of genome regulation. 

In 2004, Linda Jeffery and Sara Nakielny showed that 
the de novo DNA methylases Dnmt3a and Dnmt3b, but 
not the maintenance methylase Dnmtl, bind siRNAs with 
high affinity.?*" Later others reported that Dnmtl (which 
restores methylation at hemi-methylated CpG sites after 
DNA replication) binds IncRNAs to alter DNA methyla- 
tion patters at cognate loci.?552?! 

Demethylation also appears to be an active process 
guided by RNAs???» RNA-directed DNA demethyl- 
ation has also been reported to involve R-loop forma- 
tion,2%%%4 and recruitment of the TET2 dioxygenase/ 
demethylase (which unlike other TET enzymes does not 
contain a DNA-binding domain, but does bind RNA?29%%7) 
by RNAs transcribed from endogenous retroviruses.??* 

Some  methyl-CpG-binding proteins, including 
MeCP2, bind siRNAs and other RNAs, mediated 
through a domain distinct from the methyl-CpG-binding 
domain with, interestingly, RNA and methyl-CpG bind- 
ing being mutually exclusive,272% although there seems 
to be variations of the regulatory mechanisms, including 
RNA-mediated recruitment to phase-separated hetero- 
chromatin compartments (see below). LncRNAs have 
also been shown to link DNA methylation with his- 
tone modification through triplex formation with target 
sequences.?0 


! Transcriptional gene silencing can be induced by siRNAs in the 
absence of DNA methylation,?* indicating that small RNAs partici- 
pate in other pathways that control chromatin state and architecture. 
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GUIDANCE OF HISTONE MODIFICATIONS 


A range of histone variants and over 100 different his- 
tone modifications are differentially incorporated into 
nucleosomes located at millions of different positions in 
different cell types and different stages of development 
and differentiation (Chapter 14) However, like DNA 
methylation enzymes, histone-modifying enzymes also 
have no intrinsic DNA-binding capacity or specificity, 
which is often assumed to be provided by sequence-spe- 
cific DNA-binding proteins or transcription factors that 
interact with them. On the other hand, like DNA meth- 
ylation enzymes, many histone modification writers and 
readers contain domains that bind RNA and/or contain 
RNA-binding modules. These include RNA recognition 
motifs,?!  chromodomains,29%3023%  bromodomains,2% 
Tudor domains,*°* PRC2 subunits EZH2, EED, Suz12 
and Jarid2,505-310 the H3K20 trimethylase Suv4—20h 
and other histone-modifying complexess.*!! 

RNA binding to histones was first reported in the mid- 
196082235 (Chapter 4). Around the turn of the century, 
a number of groups showed that PcG (Polycomb group) 
proteins from C. elegans and vertebrates also bind RNA, 
and that this binding is essential for their chromatin local- 
ization and repression of homeotic genes.?^?!6 In 2005, 
Renato Paro and colleagues showed that the switch from 
the silenced to the activated state of a Polycomb response 
element in the Drosophila bithorax locus (Chapter 5) 
requires non-coding transcription.?" 

In 2007, John Rinn and colleagues showed that 
IncRNAs transcribed from human homeotic gene loci, 
like those in Drosophila, are expressed along develop- 
mental axes and demarcate active and silent chromosomal 
domains that have different H3K27me3 profiles and RNA 
polymerase accessibility, the exemplar of which, HOTAIR 
(Chapter 13), interacts with PRC2 and is required for 
PRC2 occupancy and histone H3K27 trimethylation at 
the HOXD locus.*'® Other studies showed, for example, 
that retinoic acid-induced expression of IncRNAs follows 
the collinear activation state and correlates with loss of 
Polycomb repression at the HOXA locus.?? 

In 2008, the groups of Chandrasekhar Kanduri and 
Peter Fraser showed that IncRNAs differentially expressed 
from parentally imprinted loci, specifically the 105 kb Air 
RNA and the 91 kb Kenglotl RNA (formerly KVLQTI-AS, 
Chapter 9), and later others, also bind PRC2 to repress the 
relevant alleles.??3? In the same year, we showed that 
IncRNAs expressed from the antisense strand of homeotic 
gene loci during embryonic stem cell differentiation are 
associated with both the chromatin-activating Trithorax 
MLLI complex and activated chromatin containing 
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H3K4me3 marks.?? Subsequently other IncRNAs (includ- 
ing enhancer RNAs and small RNAs derived from 
them) were also shown to associate with Trithorax com- 
plexes, including RNAs involved in maintenance of stem 
cell fates and lineage specification (such as Evxl-as and 
HOTTIP),1+45524-334 and even grain yield in rice.?? 

In 2009, Rinn and colleagues surveyed over 3,300 
IncRNAs and showed that ~20% (but only ~2% of 
mRNAs) interact with PRC2, and that others are bound 
by other chromatin-modifying complexes.* Moreover, 
knocking down a selection of these RNAs caused dere- 
pression of genes normally silenced by PRC2.3 Over 
9,000 RNAs bind PRC2 in embryonic stem cells,*°° with 
many individual cases, including the IncRNAs H19, 
MEG3, ANRIL and HOTAIR, subsequently character- 
ized in some detail.303.309.337-342 

RNA has also been shown to be required for PRC2 
chromatin occupancy, PRC2 function and cell state defini- 
tion.?9 Short RNAs transcribed from Polycomb-repressed 
loci resemble PRC2-binding sites in Xist, and interact 
with PRC2 through its subunit Suz12.?" PRC2 binds 
G-quadruplex structures in RNA,*4 which inhibit PRC2 
activity and are antagonized by allosteric activation of 
PRC2 by H3K27me3 and regulators of histone methyltrans- 
ferases,*4546 indicating complex decisional transactions. 

PRCI function also appears to be controlled by 
RNA? and PRCI resides in membrane-less phase-sep- 
arated nuclear organelles?^ that are likely to be RNA 
nucleated (see below). 

PRC2* binds many RNAs 'promiscuously,?^ a 
description! that does not mean 'non-specifically”.339349 
The association of Polycomb and Trithorax complexes 
with many RNAs, likely through orthologous domains 
(see below), is consistent with their function as guide 
molecules for RNA-directed site-specific histone and 
DNA modifications. It is also consistent with the fact 
that Polycomb and Trithorax proteins are involved in 
many differentiation and developmental decisions, from 
cell cycle regulation to embryogenesis and body plan 
specification,5955 so their binding to many different 
guide RNAs would be expected. Again, these include 
RNA-DNA, RNA-RNA and RNA-protein interactions, 
recruitment or eviction of histone modifiers and other 


| ANRIL also binds PRCI. 

* For many reasons, PRC2 has been the most intensively studied of 
all of the histone-modifying complexes with respect to the role of 
IncRNAs in epigenetic regulation of gene expression.?? 

The presence of intrinsically disordered domains in most proteins 
involved in development has also been described as conferring pro- 
miscuity, a functional trait that allows flexible interactions in regula- 
tory networks (see below). 
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FIGURE 16.2 The organization of the Xist locus. The Xist, Tsix, Jpx, Xite, Tsx and Ftx genes specify IncRNAs. (Reproduced 
from Loda and Heard?? under Creative Commons attribution license.) 


chromatin-modifying proteins, alteration of DNA topol- 
ogy, allosteric inhibition, and reorganization of phase- 
separated domains. ?+2.343,345,346,356-364 

LncRNAs also control switching between Polycomb 
and Trithorax response elements.2%356 Other histone 
modifications are also regulated by IncRNAs," including 
during memory formation.*% An intriguing observation, 
not inconsistent with RNA involvement, is that histone- 
modifying enzymes, rather than the parental histones, 
may remain associated with DNA through replication 
to re-establish the epigenetic information on the newly 
assembled chromatin.*%” 


XIST AS THE EXEMPLAR 


While initially thought to be a special case, the best 
characterized and most illustrative example of the com- 
plex interplay between IncRNAs, chromatin structure 
and gene expression is Xist.30836% Xist has eight exons 
and is 17kb in length.*% It has a highly modular struc- 
ture, including a number of types of conserved ‘repeat’ 
sequences, and interacts with over 80 different proteins 
including cohesin, Polycomb and other chromatin remod- 
elers, “037 at low copy number.*>3”° 

Xist-mediated silencing of the inactive X chromo- 
some in mammals requires its repeat sequences??? 


™LncRNAs also control the methylation of a number of non-his- 
tone proteins involved in cell signaling, gene expression and RNA 
processing. 


and involves Polycomb recruitment, deacetylation of 
H3K27ac and H3K27 methylation on the silenced chro- 
mosome.?05368582357 Spreading involves the partitioning 
of chromatin topology*883% by the formation of phase- 
separated domains*7%5%3% and the interaction with RNA- 
binding proteins with repetitive elements (in particular 
LINE-1 elements) in the X chromosome to recruit silenc- 
ing mechanisms targeted to repeats,+7239%-3% as first pro- 
posed by Mary Lyon,*??^'! likely via triplex formation.*% 
Xist also acts as a suppressor of hematological cancers? 
and is essential to maintain X-inactivation of immune 
genes, dysregulated in females suffering systemic lupus 
erythematosus or COVID-19 infection.?5? 

Xist expression and action is controlled and effected 
by other IncRNAs* that are expressed antisense to Xist 
(Tsix, which blocks RepA RNA binding to PRC2?0404405) 
or from adjacent loci on the active or inactive X chro- 
mosomes+%-4% (Figure 16.2). The Jpx IncRNA, whose 
gene resides ~10kb upstream of Xist, activates Xist by 
regulating CTCF anchor site selection to alter the topog- 
raphy of chromosome loops.?^*09^? A primate-specific 
TE-derived IncRNA, X ACT, coats active X chromosomes 
in pluripotent cells and is connected into the pluripotency 
regulatory network in humans by primate-specific retro- 
viral enhancer?! Another IncRNA expressed from the 
X-inactivation center, Tsx, functions in germ and stem 
cell development as well as in learning and behavior.*!? 
The IncRNA Firre, which also contains a number of 
repeats, one of which interacts with the nuclear matrix 
factor hnRNPU, anchors the inactive X chromosome 
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near the nucleolus and is required for the maintenance of 
its repressive H3K27me3 marks.^? Firre is also required 
for the topological organization of other chromosomal 
regions,41+415 and is involved in other developmental pro- 
cesses including adipogenesis and hematopoiesis.^!ó 

X-inactivation also involves the RNAi enzyme 
Dicer?652745 and methylation of Xist transcripts,^? 
implicating small RNAs, the RNA interference pathway 
and RNA modifications in a complex set of decisional 
pathways that control chromatin architecture from yeast 
to humans.??.?! Chromosomal dosage compensation in 
Drosophila, which involves global activation of the sin- 
gle X chromosome in males is controlled by the IncRNAs 
roX1 and roX2,*? via a conserved predicted stem-loop 
structure required for histone H4K 16 acetylation of the X 
chromosome? and selective X-chromosome subnuclear 
compartmentalization.+4 


ENHANCER RNAs AND 
CHROMATIN STRUCTURE 


As discussed in Chapter 14, enhancers play a key role in 
specifying cell identity and are a signature feature of the 
regulation of gene expression during development. 

Enhancers were initially identified by their activity, 
rather than their physical manifestation, but have been 
interpreted in terms of the initial speculations about the 
latter, postulated and promulgated by Mark Ptashne (and 
widely accepted) to comprise cluster of binding sites for 
TFs that act at a distance by ‘looping’ to make contact 
with the promoters of target genes, some of which can be 
located hundreds of kilobases distant.*25-1 

However, early studies had shown that IncRNAs 
are transcribed from enhancer regions in well-stud- 
ied loci,4?-434 with supporting evidence accumulating 
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with improving transcriptomic and chromatin analy- 
sis technologies. Although sometimes referred to as 
(protein-coding) ‘gene deserts',? enhancers exhibit the 
characteristics of bona fide genes, including nucleosome- 
depleted promoter regions that bind transcription fac- 
tors and the transcription of adjacent sequences.?6-^4! 
Indeed, the epigenetic architecture of, and the features 
of transcription initiation at, the promoters of conven- 
tional protein-coding genes and enhancers are almost 
indistinguishable.46441-445 

Enhancers and 'super-enhancers' are transcribed to 
produce non-coding RNAs specifically in the cells in 
which they are active317:435:437.441-447 and their expression 
is considered the best molecular indicator of enhancer 
activity in developmental processes#7443448-453 and 
cancers.91.454-457 

Enhancers recruit RNA  polymerase?? and pro- 
duce short unstable bidirectional transcripts (‘eRNAs’) 
from their promoters,+7:444,459-462 as do protein-coding 
genes, 90465-4655 and it is uncertain whether these tran- 
scripts play a role in enhancer action or simply mark 
active promoters and/or reflect promiscuous RNA poly- 
merase initiation at accessible chromatin.+41461,464466-468 
On the other hand, enhancers also express multi-exonic 
IncRNAs, the half-life of which is exosome regulated, *66 
and many if not most IncRNAs likely derive from enhanc- 
ers. 441444446469-79 Enhancers with tissue-specific activity 
are enriched in introns, suggesting that “the genomic 
location of active enhancers is key for the tissue-specific 
control of gene expression". 490 

There is good evidence that enhancers regulate chro- 
mosome looping and local chromatin reorganization to 
alter cell fate, 9544145148148? which is consistent with but 
does not demonstrate direct contact between enhancer 
TF-binding sites and the promoters of target genes. It is 


FIGURE 16.3 Cloud formed by the enhancer IncRNA Evf2 and its localization to activated (Umadl, 1.6 Mb distant) and 
repressed (Akr1b8, 27 Mb distant) target protein-coding genes. (Reproduced with permission from Cajigas et al^? with permis- 


sion of Elsevier.) 
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also consistent with enhancer RNAs organizing loop- 
ing with target genes, which has been demonstrated in 
at least two cases,*82483 and/or the formation of topo- 
logically associated, possibly phase-separated, chroma- 
tin domains!? as local hubs of transcription regulation 
(Figure 16.3). Recent studies suggest that there is no 
direct contact between TFs bound at the enhancer and 
the promoter of genes regulated by enhancer action,*** 
and that maintenance of enhancer-promoter interactions 
and activation of transcription are separable events.*5 

At the heart of the debates about the mechanism of 
enhancer action has been the question of whether the 
RNAs transcribed from active enhancers are simply a 
passive by-product of TF occupancy, or whether it is the 
“act of enhancer transcription’ or the enhancer RNAs 
themselves that mediate enhancer action.^42447486487 The 
evidence for these possibilities, which are not mutually 
exclusive, has been widely canvassed and variously inter- 
preted, with the former initially favored because it fitted 
the TF paradigm of protein-coding gene regulation and 
did not require acceptance of large numbers of regula- 
tory RNAs.*4!476.487-489 In line with these preconceptions, 
some studies have reported that the transcribed enhancer 
RNA sequences can be partly/substantially (it is difficult 
to be sure") deleted, truncated or replaced with no obvi- 
ous effect (see, e.g., 499491). 

Other studies showed that enhancer RNAs are required 
for enhancer activity.92:473479482492-498 For example, dele- 
tion in mice of the multi-exonic IncRNA Maenli, which is 
expressed from an enhancer that controls limb development 
and is deleted in a human developmental disorder, recapit- 
ulates the human phenotype.*” Deletion of internal exons 
of the enhancer-derived IncRNA ThymoD blocks T-cell 
development and causes developmental malignancies.**? 
Truncation of the IncRNA Evf2, which is transcribed 
from the highly conserved D1x5/6 enhancer that spatially 
organizes the expression of a 27 Mb region on chr6 during 
mouse forebrain development, abrogates the action of the 
enhancer.“ The latter study also showed that the 5' and 
3' ends of the Evf2 enhancer RNA had different functions,° 
and that Evf2 formed an “RNA cloud" encompassing its 
target genes.^? It has also been shown that methylation and 
splicing of enhancer RNAs are required for enhancer func- 
tion and chromatin organization.490;500-502 


n Due to incomplete characterization of the enhancer RNA/transcrip- 
tion unit. 

e Evf2, which has a human homolog, also interacts with Sox2 to alter 
its target specificity, regulates transcription of the homeodomain 
transcription factors DIx5 and DIx6 as well as cohesin binding, 
and influences chromatin remodeling in the formation of GABA- 
dependent neuronal circuitry. 2245473499 
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siRNA-mediated knockdown of enhancer RNA also 
abrogates or reduces enhancer action, demonstrating the 
involvement of the RNA.^750507 RNA has been shown 
to be required for the formation of enhancer-target pro- 
moter contacts by the transcription factor Y Y 11% (which 
itself regulates the expression of many IncRNAs*%%8) and 
ectopic expression of enhancer RNAs upregulates expres- 
sion of the genes normally targeted by the enhancer.*72.509 

Careful analysis shows that the variable phenotypic 
consequences of enhancer IncRNA knockdown, trunca- 
tion or ablation depend on the details.^65!05!! For exam- 
ple, a short deletion of the promoter and first two exons of 
the 17kb IncRNA Hand2osl, which is expressed from an 
enhancer essential for heart morphogenesis, did not pro- 
duce discernable heart phenotypes, but deletion of exons 4 
and 5 caused severe contraction defects in adult heart that 
worsened with age, and deletion of the entire Hand2os1 
sequence led to dysregulated cardiac gene expression, 
septum lesion, heart hypoplasia and perinatal death.5!0 

It has been shown that enhancer RNAs produced in 
response to immune signaling bind the bromodomains of 
BRD4 (and other epigenetic reader bromodomain-contain- 
ing proteins) to augment BRD4 enhancer recruitment and 
transcriptional cofactor activity.?? BRD4 also cooperates 
with oncogenic fusions of MLL1 to induce transcriptional 
activation of enhancer RNAs, one of which has been shown 
to bind histone H4K3lac to promote histone recognition 
and oncogene transcription? Other mechanisms may 
involve the interplay between different types of regulatory 
RNAs and chromatin-associated proteins, as suggested by 
interactions between NEATI and BRD4/WDRS5? com- 
plexes, and enhancer RNAs with cohesin, with effects on 
specific target genes.^^ Finally, there is also evidence of 
concerted action of cis-acting enhancer RNAs with other 
transcripts that have trans-acting roles.5!4-516 

While there may not yet be universal acceptance, the 
evidence is accumulating that enhancer RNAs are inte- 
gral to enhancer function,*” and that enhancer RNAs are 
simply a (large) class of IncRNAs that regulate chroma- 
tin architecture and the expression of protein-coding and 
(other) IncRNAs, albeit through physical mechanisms that 
are not yet well understood, but involve recognition of 
effector proteins and sequence-specific RNA-DNA con- 
tacts via R-loops or triplexes,!^?212365/5 and formation of 
topologically associated domains to form developmental 
stage-specific transcriptional hubs.*91.519.520 It is also evident 
that transcription itself modulates chromosome topology 
and phase transition-driven nuclear body assembly.*21324 


P WDRS is a a core subunit of the human MLL1-4 histone H3K4 meth- 
yltransferase complexes. 
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There are ~400,000 enhancers (and ~400,000 dif- 
ferentially accessible chromatin elements, which 
likely correspond to promoters) in the human gen 
ome.48,441,444,448,456.460,523-333 This is similar to the number of 
IncRNAs expressed from the human genome (Chapter 13). 
Indeed, apart from the fact that they do not encode proteins,4 
enhancers might be properly viewed as genes, which, 
together with the multitude of other genes expressing func- 
tional non-coding RNAs, resolves the G-value enigma. 

mRNAs may also have enhancer function,!24% which 
would not be surprising given the interwoven nature of 
the expression of genetic information during the complex 
ontogenies that underpin animal and, to a lesser extent, 
plant development. 


RNA SCAFFOLDING OF PHASE- 
SEPARATED DOMAINS 


It has been known for many years that there are many 
ribonucleoprotein (RNP) complexes that exist in defined 
territories in the nucleus and cytoplasm of eukaryotic 
cells, prominent examples of which include nucleoli, spli- 
ceosomes, paraspeckles and stress granules,??7? none of 
which are membrane-bound. One of the most important 
advances of recent years, first canvassed by Harry Walter 
and Donald Brooks in 1995, and later demonstrated by 
Clifford Brangwynne, Anthony Hyman and colleagues, 
is that these and other focal or ‘punctate’ organelles are 
phase-separated condensates or 'coacervates?405!! that 
compartmentalize biochemical and regulatory hubs,5*? 
although not without some controversy and uncertainty.^? 

Phase-separated condensates, which are heteroge- 
neous in constitution and properties, are commonly 
referred to as *phase-separated domains’ (PSDs). They 
are also called ‘liquid crystal domains’, ‘liquid droplets’, 
‘biomolecular condensates’, ‘nuclear clouds’ or ‘nuclear 
bodies”, and exist in an aqueous state distinct from the 
surrounding environment, the biological manifestation of 
liquid or soft matter physics.?^ 

In vitro PSDs form spontaneously by association of 
oppositely charged molecules such as negatively charged 
RNA interacting with positively charged proteins.?^^ In 
vivo they are formed by interactions between RNAs," 
RNA-binding proteins and proteins with intrinsically dis- 
ordered domains (IDRs),2753954254454699 explaining the 
latter's previously mysterious function.??? 


New mechanisms that control the expression and function of RNAs 
expressed from promoters of protein-coding genes and enhancers are 
still being identified, affecting splicing, elongation, termination and 
processing/half-life.166534536 

" ‘RNA regulates the formation, identity, and localization of phase- 
separated granules”.54 
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IDRs lack rigid tertiary structure and are character- 
ized by a high proportion of small, polar and positively 
charged amino acids (arginine, histidine and lysine), 
often in the form of RGG/RG, histidine-rich domains 
or other repeats.55%-5% IDRs are promiscuous, i.e., they 
interact with and are tunable by many partners.550556-56! 
Intrinsically disordered RGG/RG domains mediate spec- 
ificity in RNA binding,*2563 and IDRs flank the DNA- 
binding domains of transcription factors; direct TF 
binding and, inter alia, the temporal regulation of tran- 
scription complexes that specify neuronal subtypes.55 

IDRs and PSDs occur in bacteria and archaea,565-568 
but there have been sharp increases in the fraction of 
the proteome containing IDRs between prokaryotes, 
simple eukaryotes and multicellular organisms, and 
the number of proteins containing IDRs correlates 
with the number of cell types, suggesting co-evolution 
of IDR-mediated transactions with developmental 
complexity.55%560 

IDRs are usually located at the N- or C-terminal region 
of the protein. IDRs are present in and essential for the 
function of nearly all of the proteins involved in animal 
and plant development, including RNA polymerase, most 
transcription factors, Hox proteins, histones, histone- 
modifying proteins, other chromatin-binding proteins, 
the Mediator complex, RNA-binding proteins, splicing 
factors, membrane receptors, cytoskeletal proteins and 
nuclear hormone receptors.?71,272,274,539,551,560,563,570-577 

Surprisingly, the majority of proteins subject to alter- 
native splicing contain IDRs.5% Moreover, IDRs are 
overrepresented in alternatively spliced exons subject to 
tissue- and lineage-specific regulation,?75*! especially in 
exons that are alternatively spliced in mammals but con- 
stitutively spliced in other vertebrates,>*? which changes 
the subcellular localization of the isoform and coordi- 
nates phase transitions within the cel].553584 

IDRs also occur in proteins that are flexibly involved 
in signal transduction and transport,% such as the Ras- 
GTPase-activating proteins (SH3 domain)-binding pro- 
teins G3BP (which forms phase-separated domains), 
clathrin-mediated endocytosis?$? and synapsins, which 
are required for the maintenance of synaptic vesicle clus- 
ters in neurons by IDR-mediated phase separation.^*7 

IDRs are major sites of post-translational modifica- 
tions and many biological processes, including the regu- 
lation of the cell cycle and circadian clocks,?? have been 
shown to be dependent on post-translational modification 
of IDR s.5%9585-5% Post-translational modifications modu- 
late RNA binding? and alter the propensity to nucleate 


s A much higher percentage than found in the rest of the proteome.2725% 
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PSDs,??? adding layers of complexity to their interactions 
and regulation.?39557,593,594 

The known post-translational modifications not only 
include the 100 or so found in histones, but also 95 in 
the IDR of the axonal microtubule—associated protein 
Tau, which is involved in Alzheimer’s and other neurode- 
generative diseases.?? Tandem RNA-binding sites in the 
RNA-binding protein, TIA-1, facilitate PSD stress gran- 
ule formation” and reduction of this protein protects 
against Tau-mediated neurodegeneration.*?’ 

Other proteins involved in neurological functions and 
disorders, such as TDP-43, ataxin, c9orf72 and FMRP 
(fragile X mental retardation protein), also contain IDRs 
that are involved in phase separation, controlled in part 
by post-translational modifications.5%5% For example, 
the IDR of TDP-43 binds RNAs,% and the loss of its 
RNA-binding ability by mutations or post-translational 
acetylation leads to its sequestration into PSDs.%! The 
IDRs of ataxin mediate formation of neuronal mRNP 
assemblies, and are essential for long-term memory for- 
mation as well as c9orf72-induced neurodegeneration.9?? 
Neuronal-specific micro-exons overlapping IDRs in the 
translation initiation factor elF4G regulate the coales- 
cence of phase-separated granules to repress translation, 
are misregulated in autism, and their deletion in mice 
leads to altered hippocampal synaptic plasticity and defi- 
cits in social behavior, learning, and memory.9? 

Aberrant promiscuity of IDR-containing proteins 
(IDPs) and perturbations of PSD formation may under- 
lie the dosage sensitivity of oncogenes and other pro- 
teins60+60%5 as well as neurodegenerative disorders 
such as Alzheimer's Disease, Parkinson's Disease, 
Frontotemporal Dementia, Muscular Dystrophy and 
Amyotrophic Lateral Sclerosis, where repeat expansions 
affecting RNA and/or their encoded proteins result in 
pathological aggregates.595:606.607 

Mutations in RBPs that cause human monogenic dis- 
eases are observed more commonly in IDRs than globular 
domains,?56905 indicating that, despite their relatively sim- 
ple composition, IDRs have strong sequence constraints. 
It is also clear that RNA nucleates the formation, and is 
the structural scaffold, of PSDs.27542.547549,551,553,609-614 

PSDs encompass a range of nuclear compartments:27615 
DNA replication initiation sites;? telomeres;?" centro- 
somes? and meiotic chromosomal pairing foci;619620 
germ granules;?^9! nucleoli, Cajal bodies and ‘his- 
tone locus bodies',4962-9^ ‘extranucleolar droplets';?? 
spliceosomes (‘nuclear speckles');?5 specialized spli- 
ceosomes (via IncRNAs Gomafu" and Malat1);%27-62 para- 


t p53 target gene association with nuclear speckles is driven by p53.9? 
« Japanese for ‘spotted pattern’. 
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speckles’ (Neat1);930-6% heterochromatin; Polycomb 
bodies; primate-specific nuclear stress bodies;6?6957 
nuclear glucocorticoid receptor foci;*38 SARS-Cov2 viral 
assembly domains;** and others, including in plants.9^? 
They also include cytoplasmic organelles?764.92 such 
as P-granules,9^594^^ G-bodies,9^ stress granules,% polar 
bodies (whose formation is dependent on a IncRNA),947648 
localized mRNP translational assemblies®?° and syn- 
aptic compartments.6! 

It has been proposed that IncRNAs play a central role 
in organizing the three-dimensional genome,?! includ- 
ing the formation of spatial compartments and transcrip- 
tional condensates610;614652-65 (Figure 16.4) and hence 
the four-dimensional patterns of gene expression dur- 
ing differentiation and development." It has been shown 
that phase separation drives chromatin looping9? and is 
required for the action of enhancers and super-enhanc- 
ers;551610658-660 that transcription factors activate genes 
through the phase-separation capacity of their activation 
domains by forming PSDs with RNA polymerase 11;%61,662 
that Mediator and RNA polymerase II associate in tran- 
scription-dependent condensates;658661-663 that phase 
separation of RNA-binding protein promotes polymerase 
binding and transcription;** and that PSDs scaffolded by 
IncRNAs, including repeat-derived RNAs, mediate het- 
erochromatin formation,32:609614665-669 euchromatin for- 
mation,9? nucleolar structure,7!-6% splicing? and DNA 
damage repair.576-678 

For example, it has been shown that the cytoplasmic 
IncRNA NORAD, which is induced by DNA damage and 
required for genome stability, prevents aberrant mito- 
sis by sequestering Pumilio proteins (which bind many 
RNAs to regulate stem cell fate, development and neuro- 
logical functions$*?) into PSDs via multiple repeats.681-6854 
Lack of NORAD accelerates aging in mice. Similarly 
sequestration of the double-strand beak enzyme RAGI 
into nucleoli modulates V(D)J recombination activity.586 
Many natural antisense IncRNAs with embedded mam- 
malian interspersed repeats are overrepresented at loci 
linked to neurodegeneration and/or encoding IDPs.55? 

The PARPI superfamily,* one of the most abun- 
dant proteins in the eukaryotic nucleus,955-9? which 
catalyzes the polymerization of ADP-ribose units and 
attachment of poly (ADP-ribose) polymers to arginines 


Y Involving RNA-protein and RNA-RNA interactions.'!3,630,631 

* PSDs may also serve to reduce noise in biological signal processing 
and control. 

* There are 18 members of the PARP superfamily encoded in the 
human genome. PARP2 is also involved in chromatin modification, 
PARP3 is a core component of centrosomes, and PARP4 is associ- 
ated with vault particles (Chapter 8).988 
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FIGURE 16.4 Subcellular and subnuclear localization of RNAs in punctate domains. (Reproduced with permission from Cabili 
et al.5 with the permission of the authors under Creative Commons attribution license.) 
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FIGURE 16.5 RNA promotes the formation of spatial compartments in the nucleus. (a) A 3D space filling nuclear structure 
model of selected IncRNAs. (b) A 3D space filling nuclear structure model of 543 IncRNAs that display at least 50-fold enrichment 
in the nucleus. Each sphere corresponds to a 1 Mb region or larger where each IncRNA is enriched. (Reproduced from Quinodoz 


et al.5! with permission of Elsevier.) 


in target proteins," including histones, for DNA repair, 
stabilization of replication forks and the modification of 
chromatin, also binds IncRNAs,9??-9 regulates RNA 
metabolism? and modulates the phase-separation 
properties of RNA-binding proteins.695.699 

Many IncRNAs appear to be localized to defined 
nuclear and cytoplasmic foci that resemble liquid drop- 
lets,336,414,473,679,700702 and a genome-wide study identified 
hundreds of non-coding RNAs forming nuclear compart- 
ments near their transcriptional loci, in dozens of cases 
guiding cooperating proteins into these 3D compart- 
ments and regulating the expression of genes contained 
within them 225-614 (Figure 16.5). 

X-chromosome dosage compensation in Drosophila 
requires the formation of a phase-separated coacervate 
by the IncRNAs roX1 and roX2 interacting with the IDR 
of a specific partner protein (MSL2, ‘male sex lethal 2’). 
Moreover, replacing the IDR of the mammalian ortholog 
of MSL2 with that from Drosophila along with expression 
of roX2 is sufficient to nucleate ectopic dosage compen- 
sation in mammalian cells, showing that the roX-MSL2 
IDR interaction is the primary determinant for compart- 
mentalization of the X chromosome, and a likely exem- 
plar of IncRNA-IDR interactions in general.?^ 

As further evidence that the eukaryotic nucleus (and 
indeed the eukaryotic cytoplasm) is finely organized, and 
that many more phase-separated domains remain to be 
discovered, a recent report has shown that two related 
RNA modification enzymes that normally reside (in one 


Y [nterestingly, a similar enzyme in bacteriophage has been reported 
to add entire RNA chains to a host ribosomal protein to modulate the 
phage replication cycle.9?! 


case) in the nucleolus and (in the other) in an unknown 
cytoplasmic domain proximal to mitochondria, both 
relocate upon nerve cell depolarization to different small 
unknown punctate nuclear domains.’ 

The complexity is extraordinary and literature on this 
topic is burgeoning, but it is now clear that PSDs comprise 
a major and until recently unappreciated fine-scale and 
dynamic spatial regulation of subcellular and chromatin 
organization, “the active chromatin hub”, first proposed 
by Wouter de Laat and Frank Grosveld in 1993 based on 
the study of globin enhancers,’* which “unifies the roles 
of active promoters and enhancers”.*% It has also been 
proposed, with experimental support, that “ribonucleo- 
protein complexes can act as block copolymers to form 
RNA-scaffolding biomolecular condensates with optimal 
sizes and structures in cells”.34 


AN ADDITION TO THE ANCIENT 
RNA WORLD HYPOTHESIS 


The ability of RNA to nucleate phase-separated domains 
adds a third dimension to its role in the origin of life.” 
While it has been widely accepted that RNA was likely 
the primordial informational and catalytic molecule of 
life, its advent would also have enabled the formation 
of a pre-cellular phase-separated privileged environ- 
ment wherein organic reactions could be concentrated 
and evolve. Indeed, compartmentalized RNA catalysis 
has been demonstrated in membrane-free coacervate 
protocells.706.707 

The development of RNA-nucleated coacervates likely 
involved its interaction with positively charged (particu- 
larly arginine-rich) disordered proteins.”% Intrinsically 
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FIGURE 16.6 The modular domain structure and interactions of IncRNAs. (Reproduced from Mercer and Mattick.7^) 


disordered proteins are encoded by the most ancient 
codons and appear to be the first polypeptides, likely to 
have functioned initially as chaperones, with catalysis 
transferred first from RNA to ribonucleoprotein com- 
plexes and then to proteins,%4355557 a process that may 
have been interactive.” 


STRUCTURE-FUNCTION 
RELATIONSHIPS IN LNCRNAS 


The length of IncRNAs varies enormously although in 
many cases their true length and structure are unknown 
due to their cell-type specificity and low representa- 
tion in RNA sequencing datasets. However, high depth 


sequencing has shown that most are multi-exonic,”' and 
some are over 100kb in length (post splicing), so-called 
macroRNAs, which have a mean length of 92 kb and are 
predominantly localized in the nucleus.”! 

Why are IncRNAs so long? The likely answer is that 
they contain a set of modular domains for binding pro- 
teins and guiding them to target sequences in DNA or 
(other) RNAs?*7?-74 (Figure 16.6). 

First, although rapidly evolving (under relaxed struc- 
ture-function constrains and positive selection for adap- 
tive radiation, IncRNAs exhibit common motifs and 
motif combinations across vertebrates,/'? and at least 18% 
of the human genome is conserved at the level of pre- 
dicted RNA structure.7!€ For example, it has been shown 
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that conserved pseudoknots in IncRNA MEG3 are essen- 
tial for stimulation of the p53 pathway." 

Second, similar and potentially paralogous predicted 
RNA structures occur at many places throughout the 
genome.718-720 

Third, IncRNAs are enriched for repeat sequences, 
which have highly non-random distributions in 
them.?.68:7! A notable feature of Xist, for example, is 
that its most highly conserved sequences are the repeat 
elements, whereas its unique sequences have evolved 
rapidly,’? and many of its biological functions, including 
PRC2 binding, are mediated through its modular repeat 
elements.?71.379.380,387,397,723 

Many of the IncRNAs referred to earlier have also 
been shown to be modular, with common features being 
rapid sequence evolution and structural divergence while 
retaining related functions, sometimes across large evo- 
lutionary distances, and the use of TE-derived sequences 
as protein-binding domains.4!0721724,725 

Indeed, TE-derived sequences and tandem repeats par- 
ticipate in many RNA-protein interactions,?^67»726777 which 
leads to the reasonable conclusion that repeat sequences 
act as RNA-, DNA-, and protein-binding domains that 
are the essential components of IncRNA function,?! and 
that TEs are key building blocks of InckRNAs?”72% (as 
well as fulfilling many other modular functions in gene 
control and gene expression; Chapter 10). Transposition 
is an efficient means of mobilizing functional cassettes?” 
and allowing evolution to explore phenotypic space by 
modulation of the epigenetic control of developmental 
trajectories. As pointed out by Neil Brockdorf, “tandem 
repeat amplification has been exploited to allow orthodox 
RBPs [RNA binding proteins] to confer new functions for 
Xist-mediated chromosome inactivation ... with potential 
generality of tandem repeat expansion in the evolution of 
functional long non-coding RNAs"??? 

Fourth, many IncRNAs bind chromatin-modifying 
proteins, transcription factors, nuclear matrix proteins 
and RNA-binding proteins, in those cases that are well 
studied, like Xist, roX and HOTAIR, to exert functional 
consequences.302311,342,371727 Tt has also been reported that 
an mRNA can act as a scaffold to assemble adaptor pro- 
tein assemblies to regulate intracellular transport.’”° 

Fifth, chemical probing has shown that IncRNAs, includ- 
ing Xist, physically have a modular structure,?71 3747507731 
and the chemical data matches that predicted by evolution- 
ary conservation of secondary structure, validating both.?”* 

Finally, the extensive alternative splicing of IncRNAs 
strongly imputes a modular structure”!07%-755 and alterna- 
tive splicing has, unsurprisingly, been shown to alter the 
function of IncRNAs.?95738 


213 


If the complex ontogeny of a human requires a large 
number of guide RNAs then it is not surprising that many 
have similar protein-binding modules, with variation 
in repertoire and genomic target specificity, which may 
only require short stretches of nucleotide complementar- 
ity, given the high strength of RNA-RNA and RNA-DNA 
interactions.” These modules may also include enzymes 
that have adjunct roles in epigenetic transactions: for 
example, the developmentally regulated IncRNA H19 
binds to and inhibits S-adenosylhomocysteine hydrolase, 
a feedback inhibitor of DNA methylyltransferases.”° 
The alternative splicing of IncRNA exons (which itself 
must be epigenetically controlled?) permits selection 
of specific protein-binding modules and target sites for 
feed-forward control of protein and regulatory RNA gene 
expression at many different loci and ultimately cell fate 
(hold, divide or differentiate) decisions at every stage of 
developmental ontogeny. There is no cogent model for 
such fine control by proteins alone. 

The challenge now is to determine the repertoire of 
RNA structures using RNA folding programs, evolution- 
ary conservation, physical and chemical analyses, and 
machine learning.?74716,718-720,742-74 The challenge is to 
determine which RNA structures bind which proteins 
or which DNA or RNA targets, with a range of tech- 
niques becoming available to map RNA localization 
and interactions, !33715!8.614.748-755 so that IncRNA biology 
can be parsed and understood, and thereby construct 
an expanded Rfam (RNA family”) database;^9 like the 
Pfam protein domain database" that has proved so use- 
ful in identifying protein function. 


A NEW VIEW OF THE GENOME 
OF COMPLEX ORGANISMS 


It is increasingly evident that IncRNAs, enhancers, 
topologically associated chromatin domains, transpo- 
son-derived sequences and other repeats, chromatin- 
remodeling and epigenetic information are merging into 
the same conceptual and mechanistic space. 

Imagine the versatility and temporal precision that 
could be achieved if the locus specificity or local acces- 
sibility of proteins that control gene expression during 
development is guided by stage-specific modular RNAs. 
The strength of RNA is its potential to address targets 
through sequence-specific duplex or triplex base-pairing 
while at the same time recruiting and directing effector 
proteins to specific genomic locations. 

We propose that regulatory RNAs, including the 
‘repeat’ sequences within them, are the evolutionarily 
and developmentally flexible platforms expressed from 
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the genome in an unfolding symphony to direct and 
execute the extraordinarily complex decisions required 
for the precise ontogeny of trillions of differentiated 
cells in a human, and similarly in other mammals, ver- 
tebrates, invertebrates and plants. That is, the epigenetic 
marks that control development, physiological adapta- 
tions and brain function are positioned and controlled 
by RNAs (among their many other functions in cell biol- 
ogy and gene regulation), and that the proportion of the 
genome devoted to specifying regulatory and architec- 
tural RNAs increases with developmental and cognitive 
complexity.75875 

Small RNAs (miRNAs, sgRNAs, etc.) are simple 
sequence-specific guides for a single type of effector, such 
as RISC or CRISPR/Cas. LncRNAs not only have target 
sequence specificity but are also scaffolds for a range 
of proteins, notably chromatin-modifying complexes, 
with both targets and cargoes being four-dimensionally 
regulated by alternative splicing of IncRNA exons in a 
feed-forward cascade that directs the next cell fate deci- 
sion during development. This is a highly efficient system 
that, like RNAi and CRISPR but in a far more sophisti- 
cated and modular manner, directs generic protein effec- 
tors to their sites of action. 

The misleading historical perspective on the relation- 
ship between RNAs and proteins was best expressed 
by Ewa Grzybowska and colleagues, who concluded: 
"The current perception of RNA-protein interactions 
is strongly biased toward a protein-centric approach, in 
which proteins regulate the expression and activity of 
RNA, not the other way around.?6? 
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7 Plasticity 


Gene-environment interactions occur in all organisms, 
ranging from bacterial transcriptional responses to nutri- 
ent availability to epigenetic changes in eukaryotes in 
response to environmental circumstances, such as that 
Observed in the increased production of erythrocytes and 
increased hemoglobin expression at high altitudes'? or 
the advent of type 2 diabetes upon prolonged obesity,* 
although genetic factors are also involved.25 

The major focus of the studies of molecular basis of 
long-term gene-environment interactions in humans and 
other eukaryotes has been changes in the patterns of DNA 
methylation®* and, more recently, histone modifications.?? 
However, it is clear that RNA is also modified, over a far 
wider chemical range than DNA. If DNA methylation and 
histone modifications are RNA-directed (Chapter 16), as 
are some RNA modifications (Chapter 8), it is logical that 
RNA is the major conduit for epigenome-environment 
interactions! and, by extension, that the expansion of 
RNA modifications and RNA editing in complex organ- 
isms underpins phenotypic plasticity, learning, and cogni- 
tion.? It also appears increasingly likely that RNA is the 
vehicle for transgenerational soft-wired inheritance of 
experience.^.^ 


RNA MODIFICATIONS AND THE 
UNKNOWN EPITRANSCRIPTOME 


There are over 25 cell-specific versions of the 5' cap 
structure of RNAs and non-canonical initiating nucleo- 
tides with functional consequences in RNA processing, 
export, stability and translation.^-!5 There are also over 
140 known chemical modifications of internal ribonu- 
cleotides!”2 (Figure 17.1) These modifications occur 
on all four standard bases as well as on the ribose, and 
were detected initially in the highly abundant rRNAs 
and tRNAs, and later in snoRNAs and snRNAs. For 
decades RNA modifications were thought to be irrevers- 
ible decorations important for structural stability and/or 
catalytic function, including in prebiotic evolution,” but 
this view changed with the discoveries that rRNA modifi- 
cations are context-specific,?* that modifications occur in 
thousands of mRNAs, IncRNAs, enhancer RNAs, vault 
RNAs, miRNAs and other non-coding RNAs,” and, 
especially, that RNA modifications are reversible,*8-40 
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leading to the birth of the term ‘epitranscriptome’ to 
describe the collective of regulated RNA modifications.*! 
There are technical challenges in identifying RNA 
modifications in sequencing datasets.? Most of what is 
presently known stems from the study of mA, and to a 
lesser extent m'A, meAm (N6,2’-O-dimethyladenosine), 
m?C and pseudouridylation modifications, for which 
there are specific antibodies or reagents available? for 
immunocapture to identify the positions of modified 
bases.2>:26.28-33,43-46 
m°A modifications in mRNAs occur typically near 
the stop codon, but also in 5'UTRs, coding sequences, 
introns and 3'UTRs.25222.7 mA modifications have been 
found to regulate miRNA processing?^^ and mRNA 
polyadenylation, processing, splicing, stability, translation 
and export^*-? by destabilizing RNA duplexes and alter- 
ing RNA-protein interactions.9-? mA modification of 
mRNAs, promoter-associated RNAs, enhancer RNAs and 
repeat RNAs have been shown to regulate chromatin state, 
phase separation, the stability of R-loops (and their role in 
the repair of double-strand breaks) and transcription.3-© 
mA modifications have been shown to regulate yeast 
meiosis,” the activity of endogenous retroviruses,” het- 
erochromatin formation and TE function in embryonic 
stem cells and early embryos,274 mammalian stem cell 
renewal, differentiation and development,++525%75-78 sper- 
matogenesis, oogenesis and fertility,5%56577930 embryonic 
development,5!:% adipogenesis,*? circadian rhythms,4? 55:54 
stress responses,?-* hematopoietic differentiation,5* 
immune responses? and inflammation,???! IgH recombi- 
nation, neurogenesis,”*-* neural differentiation”” and 
neural circuitry,” spatiotemporal control of mRNA 
translation in neurons,!!0! cerebellar development, 9?.!03 
learning and memory,'?^-? cancer stem cell differen- 
tiation,!!° neuronal functions and sex determination in 
flies!!1-!? and even rice and potato yield,!!* in some cases 
involving interplay with histone modifications.?? 
Although most current techniques favor highly 
expressed RNAs, m^A modifications are increasingly 
being detected in non-coding regions of pre-mRNAs and 
IncRNAs in tissues such as placenta, kidney, liver and 
brain, with evidence of modulation of IncRNA functions 


a [n the case of m5C, RNA bisulfite sequencing.’ 
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FIGURE 17.1 The set of known RNA modifications classified by their reference nucleotide, highlighting those that have been 


associated to a human disease (red), as well as those for which a transcriptome-wide detection method has been established 


(green). (Reproduced from Jonkhout et al.!”) 


and properties, including splicing regulation and possible 
effects of sequence polymorphisms.!? Specific examples 
include m$A modification of MALATI as a structural 
switch affecting its protein binding and phase separation 
properties, 6/7 enhancement of Xist repressive activ- 
ity, 5 and modulation of the stress-induced repressor of 
cyclin DI pncRNA-D to induce cell cycle arrest.!!? 

Other RNA modifications that have been studied have 
been shown to similarly affect a wide range of processes 
including chromatin organization, mRNA stability, tRNA 


and miRNA processing, and to be involved in neurological 
and other disorders.3036,46,77,120-126 Many of these modifica- 
tions have been documented to affect the function of regu- 
latory RNAs, such as SR A,” 7SK?*? and vault RNAs? 

As with histone modification enzymes, there are a 
range of RNA modification writers, readers and eras- 
ers, the loss or perturbation of which, including in 
rRNAs and tRNAs, results in a range of diseases, includ- 
ing cancer, intellectual disability and developmental 
disorders.!9.20.130-132 
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The repertoire, substrate range (often from tRNA to 
mRNA) and deployment of RNA modification enzymes 
have been expanded by successive gene duplications that 
have occurred at the base of the eukaryote, metazoan, 
vertebrate and primate lineages, with 90 cataloged in the 
human genome.!* 

In mammals there are multiple móA writers (the 
METTL protein family), readers and erasers!5%1% with 
far-reaching functions. For example, METTL3 regu- 
lates heterochromatin in embryonic stem cells? and 
promotes homologous recombination-mediated repair of 
double-strand breaks by modulating DNA-RNA hybrid 
accumulation.” METTL16, which is essential for mouse 
embryonic development, regulates the expression of 
an enzyme that produces the methyl donor, S-adenosyl 
methionine.!% The m*A reader Ythdcl regulates the scaf- 
folding function of LINEI RNA in mouse ESCs and early 
embryos.” The m^A reader Prrc2a controls oligodendrog- 
lial specification and myelination.” The ALKBH5 m°A 
eraser controls translation“ and the splicing and stability 
of long 3’UTR mRNAs in male germ cells.*° 

There are eight mammalian m?C writers, Nsuns1—7 and 
Dnmt2. Nsunl, 2, 5 and Dnmt2 are present in all eukary- 
otes, whereas the other Nsuns are specific to multicellular 
organisms and are differentially expressed during devel- 
opment, particularly in the brain: Nsunsl-4 participate in 
embryonic development, cell proliferation and differentia- 
tion; disabling mutations in Nsun2? and Nsun7 cause intel- 
lectual disability and male sterility, respectively; Nsun5 is 
essential for normal growth and cerebellar development; 
Nsun6 associates with the Golgi apparatus and catalyzes 
the formation of m5C72 in specific tRNAs. It also methyl- 
ates mRNAs, particularly in their 3'UTRs, but is appar- 
ently dispensable for normal development.36:4675,121,137-141 
Nsunl binds RNA polymerase II (RNAPII) and Nsun3 
and Dnmt2 bind hnRNPK, which interacts with lineage- 
specific transcription factors, and with CDK9/P-TEFb to 
recruit RNAPII to active chromatin hubs.'?^ The subcellu- 
lar localizations of enzymes that methylate guanosine are 
altered by neuronal stimulation? (Figure 17.2). 

The field is in its infancy: most RNA sequencing pro- 
tocols involve conversion to DNA, with concomitant loss 
of modification information, although mismatch patterns 
and blockage of reverse transcription can provide an indi- 
cation.'? A solution is at hand with the advent of direct 
RNA sequencing using nanopore technology; and the 
(altered) signals from some base modifications are now 


^ Loss of the Nsun2 ortholog in Drosophila causes short-term memory 
deficits. 

* Excitingly, nanopore sequencing has recently been adapted to read 
protein sequences at single amino acid resolution. ^*.^5 
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being identified.-49-5? One can only speculate on the 
processes controlled by the many other RNA modifica- 
tions that are yet to be analyzed. 

In 2005, Katalin Karikó and colleagues discovered 
that RNAs containing modified nucleotides [m°C, m°A, 
m'U, s?U or 1I-methylpseudouridine? (m"¥, a naturally 
occurring component of eukaryotic 188 rRNA)] do 
not activate the mammalian Toll-like receptor 7 that 
detects single-stranded RNAs, innate immune recep- 
tors,?? potentiating the development of mRNA-based 
vaccines.^! Proteins produced from mRNA vaccines 
stimulate both innate and adaptive immune responses, 
and the platform is much more flexible and scalable than 
attenuated viruses.!5% mRNA vaccines can be produced 
within a day or two of a viral sequence being available 
and formed the frontline against SARS-CoV-2, and will 
likely be a platform for the delivery of other vaccines, 
autoimmune rectification, targeted cancer therapies and 
even the treatment of heart failure in the future, ^4-/58 
using lipid nanoparticles for delivery.^? There is also 
increasing appreciation of the potential for non-coding 
RNA therapeutics, given the central role of RNA in most 
biological processes, and growing evidence for the effi- 
cacy of non-coding RNA interventions.!60-165 


THE EXPANSION OF RNA EDITING 
IN COGNITIVE EVOLUTION 


An important subset of RNA modifications is base 
deamination, referred to as RNA “editing”.* There are two 
classes: adenosine deamination to inosine (A>I), which 
registers as guanosine in translation and RNA sequenc- 
ing, but has differences that may be important in vivo; and 
cytosine deamination to form uracil (C>U), or methylcy- 
tosine to thymine (meC>T). Alterations in both A>I and 
C>U editing feature prominently in human cancers.!67-'! 


A>l EDITING 


A>I editing was discovered in the late 1980s by Brenda 
Bass, Hal Weintraub, David Kimelman and Marc 
Kirschner who showed that synthetic and natural RNA 
duplexes formed between sense and antisense RNAs 
expressed from the bFGF locus in Xenopus (toad) eggs, 
and double-stranded viral RNAs, are substrates for an 


d Replacement of uridine with m1¥ in synthetic mRNA improves its 
immunogenicity, translational capacity and stability.'>! 

© The term RNA editing was coined by Rob Benne and colleagues 
in 1986 to describe the small RNA-guided site-specific insertion or 
deletion of uridines in mRNAs in mitochondria of Trypanosomes 
and related protists.!64-166 
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FIGURE 17.2 Dynamic changes in subcellular location of two enzymes (TRMTI and TRMTIL) that catalyze N2,N2- 
dimethylguanosine (m2,2G) RNA modification. (a) TMRTI localizes to compartments associated with mitochondria and 
TRMTIL to nucleoli in resting human neuroblastoma cells. (b) Depolarization of the cells causes the mitochondrial TRMT to 
relocate to punctate domains in the nucleus, and the nucleoli to fragment (c). (Adapted from Jonkhout et al.!*) 


enzyme that deaminates adenosines.?-^ A>I edit- 
ing has since been extensively characterized by Bass, 
Kazuko Nishikura, Mary O'Connell, Robert Reenan, 
Charles Samuel, Peter Seeberg, Marie Óhman, Gerhardt 
Wagner and colleagues. 

A>I editing is performed by animal-specific enzymes 
called ADARs, which have evolved from enzymes that 
deaminate adenosines in tRNAs.!08,175-180 Invertebrates 
have one or two ADARs, whereas vertebrates have three 


(ADARI-3). The expression of the ADARs varies across 
development and tissues in mammals. ADARI is widely 
expressed throughout the body and is the most highly 
expressed ADAR outside the central nervous system.!*! 

Editing appears to occur both co- and post-tran- 
scriptionally.'5?/5 The basic substrate is imperfect dou- 
ble-stranded RNA, especially A:C mismatches, which 
includes dsRNA regions in pre-mRNAs and IncRNAs, 
often involving intron-exon base pairing.'*^ The substrate 
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specificities of the different ADAR orthologs is not well 
understood,!75 although evidence suggests that ADARI 
imposes symmetrical editing at positions in dsRNAs 
30-35 bp away from structural disruptions!% and that 
ADAR2 substrate recognition involves a GCU(A/C)A 
pentaloop conserved in mammals and birds.18 

A»I RNA editing is widespread in RNAs encoding 
proteins involved in neurotransmission, including pre- 
synaptic release machineries and voltage- and ligand- 
gated ion channels,!68176187158 where it alters codons 
or splicing patterns! and therefore protein structure- 
function relationships!72187191-197 to modulate the electro- 
physiological properties of the synapse and neuronal con- 
nections in response to activity,168191,192,19,198 and to adapt 
to environmental conditions.!?? 

The substrate range also includes RNAs encoding pro- 
teins involved in brain patterning, neural cell identity, mat- 
uration and function, as well as in DNA repair, implying a 
role for RNA editing not only in neural transmission and 
network plasticity but also in brain development and mem- 
ory consolidation.2% The RNAs encoding ADARs are 
themselves also edited,?%1-203 and RNA editing is regulated 
by mA modifications,* indicating feedback loops and 
interplay between RNA editing and modification systems. 

A>I RNA editing occurs commonly in introns, where it 
influences nuclear retention and splicing, including that of 
ADARQ2 itself.!76!83.205-207 RNA editing also alters miRNA 
processing, expression and target specificity.208-211 

The regulatory pathways that control A>I RNA editing 
are not understood. Presumably, editing alters the struc- 
ture and information content of coding and regulatory 
RNAs in response to environmental signals and experi- 
ence, a possibility supported by the observations that 
ADAR? has inositol hexakisphosphate (InsP6) complexed 
in its active site?? and that InsP6 regulates AMPA/gluta- 
mate receptors,” indicating that ADAR activity and/or 
target selection is linked to cell signaling pathways. 

Vertebrate ADARI and ADARQ2 are widely expressed, 
most highly in brain, where their editing profiles over- 
lap;52^ both are mainly localized in the nucleus, 
although a longer isoform of ADARI shuttles between 
the nucleus and the cytoplasm,'” where it modulates 
the innate immune response (see below). ADARI and 
ADAR2 form homo- or heterodimers in vivo and dimer- 
ization is required for catalysis.!76-178 

In addition to the deaminase domain and RNA- 
binding domains? present in other ADARs, vertebrate 


f£ ADAR2 also regulates the stability of mRNAs through editing of 
Alu elements in their 3"UTRs!* with cross-talk from IncRNAs.!° 

= There are three dsRNA-binding domains in ADARI, and two each in 
ADAR2 and ADAR3.215 
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ADARI also contains one (in the constitutively expressed 
shorter isoform) or two domains (in the inducible longer 
isoform) that recognize an alternate left-handed helical 
DNA or RNA structure, termed Z,216-22 which occurs 
naturally through the genome." ADARI binds to Z-DNA 
and Z-RNA in or near repetitive elements, especially Alu 
elements.?? 

The longer isoform of ADARI can be induced by 
interferon! and edits non-coding regions of endog- 
enous dsRNAs at Alu repeats, which are ‘flipped’ into 
Z-conformation to distinguish self and suppress inap- 
propriate activation by the MDAS helicase of the innate 
immune response that occurs in the presence of unmodi- 
fied viral dsRNAs.22522 It is also induced during learn- 
ing and its knockdown leads to a reduction in Z-DNA at 
sites where ADARI is recruited and an inability to mod- 
ify previously acquired memory? Translation of the 
longer isoform of ADARI is potentiated by mA modi- 
fication of a conserved site in the ADARI transcript,?28 
with cross-talk between modification systems during 
host responses to viral infections. Loss of ADARI in 
mice results in embryonic lethality, due to a failure of 
hematopoiesis.?!2? Mutations in human ADARI are 
one of the defined genetic causes of Aicardi-Goutiéres 
syndrome, an autoinflammatory disorder characterized 
by spontaneous interferon production and neurological 
problems.2234 Curiously, however, the loss of the edit- 
ing capacity of ADARI has little developmental effect if 
the innate immune system is prevented from sensing the 
unedited dsRNAs.2!4255 

Vertebrate ADAR2 is required for the editing of neu- 
roreceptor mRNAs, especially that encoding the AMPA 
receptor subunit GluA2 (Gria2), which has the func- 
tional consequence of rendering it Ca** impermeable.!? 
ADAR2 activity is also regulated in part by snoRNAs 
and nucleolar sequestration.2%7758 Mutations in human 
ADAR2 cause microcephaly and neurodevelopmental 
disorders.23%240 Its deficiency in mice causes seizures 
and early lethality, which can be rescued by hard-wiring 
the single nucleotide change in the Gria2 gene.?^' This 


^ There are also other DNA structures that differ from the canonical 
B-form right-handed double helix described by Watson and Crick, 
such as G quadruplexes and A-form helices, which also occur in 
double-stranded RNAs and in DNA-RNA hybrids. These were dis- 
covered in the decade from 1979 to 1989,?! and presumably have 
functional significance, but their distribution in genomes, and how 
they might vary during differentiation and development, is as yet 
only poorly characterized, a blind spot in the ENCODE projects. 
Z-DNA is commonly found in introns, and RNA editing commonly 
involves dsRNAs formed between exons and introns in pre-mRNAs, 
adding to the selective pressure on their sequences? 

The longer isoform may be constitutively expressed and even be the 
dominant isoform in some tissues.22 
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observation begs the question of why evolution has not 
imposed this change in the first place, but rather con- 
served the surrounding intronic sequences that are 
required for editing.?? 

While GluA2 receptors in adults are almost univer- 
sally edited,* such editing may be a mechanism for prun- 
ing synaptic trees during the postnatal maturation of 
neuronal circuitry in response to experience,?^^?^ so that 
only the relevant survive. ADAR2 expression increases 
as neurons mature and is spatially regulated within neu- 
rons.?525! Mice that lack ADAR2 but have the genomi- 
cally encoded compensatory codon change in GluA2 
exhibit changes in behavior, hearing ability and the 
expression profiles of RNAs in the brain, including those 
encoding proteins involved in synaptic trafficking.**? 
There is also an N-terminal extension of ADAR2 that 
is expressed most highly in the cerebellum (through the 
alternative splicing of an upstream exon), which harbors 
a sequence motif closely related to the single-stranded 
RNA-binding domain of ADAR3, which is also expressed 
most highly in the cerebellum.?? Another splice variant 
of ADAR2 has an insertion of an Alu cassette within the 
deaminase domain, which alters its catalytic activity.2% 

The single ADAR in Drosophila is homologous to 
vertebrate ADAR2 but also has similarities to ADARI, 
in that it mediates suppression of both innate immune 
responses and brain functions.?%%256 Each neuronal popu- 
lation in Drosophila has a different editing signature'®* and 
Drosophilalacking ADAR are morphologically normal but 
exhibit extreme behavioral deficits including temperature- 
sensitive paralysis, locomotor defects, tremors and neuro- 
degeneration.250256257 Bees exhibit widespread A>I editing 
during foraging and brood caring task performance.?** C. 
elegans has two ADARs, most similar to mammalian 
ADAR2 and ADAR3, which edit germline and neuronal 
transcripts (including 3'UTRs), the loss of which results in 
chemotaxis defects and reduced lifespan.259261 

ADAR3 is vertebrate- and brain-specific, contains 
both single- and double-stranded RNA-binding domains, 
and is thought to be catalytically inactive," although its 
deaminase domain appears normal, possibly because it 
does not dimerize and may act as an inhibitor of ADARI 
and ADAR2.2%226% Loss of ADAR3 in mice causes no 


k GIuA2 receptors are under-edited in malignant brain tumors.?4 

! Evidence from knockout of the Drosophila ortholog of ADAR2 also 
suggests that "edited isoforms of CNS proteins are required for opti- 
mum synaptic response capabilities in the brain during the behavior- 
ally complex adult life stage”.25 

™ There is another ADAR-like gene in mammals, TENR, which is 
expressed in the male germline and lacks a key catalytic residue in 
the deaminase domain.!65.75.777-179 
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obvious developmental deficiencies, but does affect 
learning and memory.265 

Interestingly, cephalopods such as squid, cuttlefish and 
octopus, highly intelligent invertebrates, use A>I RNA 
editing far more extensively than mammals to modulate 
the sequence of mRNAs specifying proteins involved in 
nerve-cell development and signal transmission.?5!266-269 

Editing of non-coding sequences, on the other hand, 
has expanded enormously in primates, especially 
humans.?? While for many years it was thought that 
RNA editing is primarily directed at altering protein 
sequences, analysis of large-scale cDNA sequencing 
datasets revealed that it occurs in thousands of tran- 
scripts, largely in non-coding sequences,?”” altering 
RNA structure,? and presumably regulatory circuits 
and networks, thereby influencing RNA-directed epigen- 
etic memory. These analyses also showed that there is a 
massive expansion (a 35-fold increase) of A>I editing of 
human RNAs compared to mouse, mainly in the brain 
and mostly in Alu sequences.270-274 

Alu elements invaded the genome in three waves during 
primate evolution and occupy 10.5% of the human genome 
(~1.2 million largely sequence-unique copies).?77278 The 
editing of Alu sequences is higher in human transcripts 
compared to nonhuman primates, and new editable 
human-specific Alu insertions, subsequent to the human- 
chimpanzee split, are enriched in genes related to neuro- 
nal functions and neurological diseases?" (Figure 17.3). 
Virtually all adenosines within double-stranded regions of 
Alu transcripts undergo A-to-I editing, although most sites 
exhibit editing at only low levels (<1%), and it has been 
estimated that there are over 100 million Alu RNA editing 
sites distributed across most human genes.?”” 

Alu elements (and the related B2 SINE elements in 
mice) have been linked to a wide variety of functions, 
including new exons, splice junctions, promoters, nuclear 
localization, differentiation signals and stress responses, 
both within longer transcripts and as separate small 
RNAs?80288 (Chapters 10, 13 and 16). They also have 
self-cleaving property? and their increased processing 
has been linked to neurological disorders.?552?? However, 
their intense editing suggests a wider role as modular 
cassettes that permit the superimposition of plasticity on 
an otherwise hard-wired RNA regulatory system, which 
has been positively selected for physiological adaptation 
and cognitive advance. Alu elements are derived from 
a dimeric fusion of 7SL RNA, which is part of the sig- 
nal recognition particle involved in targeting proteins 
for export (Chapter 8), and the conservation of its core 
structure?77212% may provide a clue as to its function and 
success. 
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FIGURE 17.3 Higher A>I editing levels in human versus nonhuman primates. (a) Editing levels of 75 sites in six transcripts 
originating from cerebellum tissue of four humans, two chimpanzees, and two rhesus monkeys. (b) Editing level per site for 
humans, chimpanzees and rhesus monkeys. (c) Number of common human and chimpanzee genes showing new (independent) 
Alu element insertions, 115 of which (out of 165) occur in genes with neurological function and/or associated with neurological 
disease. (Reproduced from Paz-Yaacov et al." with permission of the authors.) 


222 


RNA, the Epicenter of Genetic Information 


ur > EE) > IEEE) EEE) ETER) Rn) 


Mouse è 


> cr) E) 


Ancestral Placental Mammal 


wrt sess eee eee eee eee 


Human, Chimpanzee, Rhesus Macaque , 


*. Chicken, Lizard .* 


* 


Fish 


FIGURE 17.4 The expansion of ABOBEC genes in vertebrate, mammalian and primate lineages (colors denoting subfamilies). 
(Reproduced from Harris and Dudley?" with permission of Elsevier.) 


C>U EDITING 


C>U editing is performed by a family of proteins called 
APOBECs,2%-2% so named because the first to be dis- 
covered altered the sequence of Apolipoprotein B (‘ApoB 
editing complex 1’)," a key constituent of circulating lipid 
transport vesicles (produced in the liver), which intro- 
duces a stop codon to produce a shorter version in the 
intestine.” 

APOBECS arose at the beginning of the vertebrate 
radiation, although there is evidence of precursors in 
invertebrates,’ and can edit both RNA and single- 
stranded DNA (C>T), which muddies the functional 
waters. APOBECS can also edit meC (but not the TET- 
oxidized bases 5-hydroxymethylcytosine, 5-formylcy- 
tosine and 5-carboxylcytosine) to form T,” an editing 
event that cannot be distinguished in bisulfate-based 
methylation assays, and there may be cross-talk between 
methylation and editing systems in vivo.?? 

There are five families of APOBECs: AID and APOBEC2 
occur in all vertebrates, APOBEC4 and APOBECS in tet- 
rapods, APOBECI* in the amniotes (birds and mammals) 


n In 1987, by Lawrence Chan and colleagues, and by Lyn Powell, 
James Scott and colleagues, who reported that apolipoprotein B 
(apoB) mRNA contained a tissue-specific C>U base-modification 
that is not genomically encoded.??729 

° APOBECI deficiency has transgenerational epigenetic effects on 
testicular germ cell tumor susceptibility and embryonic viability.??? 


and reptiles, and APOBEC3 (some members of which have 
two deaminase domains) in placental mammals.293-296.304.305 
Expansions of the latter in different lineages correlates with 
the extent of germline colonization by retroviruses, most 
notably in primates, which have seven APOBEC3 paralogs 
(A, B, C, D/E, F, G, H)?95296506 (Figure 17.4), which exhibit 
some of the strongest signatures of positive selection in the 
human genome.0735 Other mammals show intermediate 
extents of APOBEC3 duplication.306309.310 The APOBECs 
show tissue-specific expression, in the immune system, 
muscle, liver and brain,?9?!! with APOBEC3G expression 
primarily in neurons?? and APOBECA in testis? 

The ancestral gene, AID, arose in the jawless fishes as 
a central feature of adaptive immunity, and is involved in 
the somatic rearrangement and hypermutation of immuno- 
globulin domains in B cells and T cells, processes that are 
also heavily regulated by IncRNAs (Chapters 13 and 16). 

The other APOBECS are generally thought to pro- 
vide a defense against exogenous retroviruses and the 
mobilization of retrotransposons, such as endogenous 
retroviruses and SINE and LINE retroelements,?06513315 
although why the APOBEC3 family expanded under 
strong selection in primates is unclear. Perhaps the clue 
is the widespread cooption of transposable elements in 
neuronal differentiation and function.?!é 

Recent studies have shown that L1 retroelements 
are mobilized in neurons in culture to induce somatic 


Plasticity 


mosaicism, a process that is controlled in part by methyl- 
CpG-binding protein 2 (MeCP2) (Chapter 14).3!73!° L1 
elements are also differentially and dynamically meth- 
ylated and histone modified during stem cell repro- 
gramming and neurodifferentiation.*2032 L1 and Alu 
retroelements are mobilized in the human brain,*2 which 
suggests the possibility that the APOBECS, especially the 
APOBEC3 family, like KRAB zinc finger proteins,?*2 
might have evolved to domesticate retroelements, and 
manage their activity? in response to environmental cues, 
not simply suppress them. 

Moreover, DNA demethylation in human neural pro- 
genitor cells leads to transcriptional activation and chro- 
matin remodeling of hominoid-specific L1 elements,4 
while older L1s and other classes of transposable elements 
remain silent; these activated Lls act as alternative pro- 
moters for many protein-coding genes involved in neu- 
ronal functions, “revealing a hominoid-specific L1-based 
transcriptional network controlled by DNA methylation 
that influences neuronal protein-coding genes”.5% 


THE BRAIN 


Despite a century since Santiago Ramón y Cajal's metic- 
ulous depictions of the architecture and complexity of the 
central nervous system,**° and many subsequent devel- 
opments in neurophysiology, we are still nowhere near 
understanding the molecular basis of high-level brain 
function. While the general architecture of the brain 
is hard-wired,?! its fine connections are selected and 
evolve in response to experience, as proposed by Gerry 
Edelman.?59 

The brain is plastic? and has most complex molecular 
transactions. Except for the testis (itself another world), 
the coding and non-coding transcriptome is most var- 
ied and complex in the brain, as is the extent of RNA 
splicing, trafficking, modification and editing. The cell- 
type specificity of IncRNA expression is also most pro- 
nounced in brain (Chapter 13) and the genomic sequence 
variants affecting neuropsychiatric functions, neurode- 
generative diseases and some neurodevelopmental disor- 
ders? primarily lie in non-coding regions (Chapter 11). 
Many neurodegenerative diseases appear to be linked to 
dysregulation of Inc/enhancer RNAs and/or be a conse- 
quence of aberrant RNA-protein interactions and for- 
mation of inclusion bodies associated with expansions 


P While avoiding neurodevelopmental or neuroinflammatory dis- 
orders due to inappropriate L1 expression of neurotoxic retroviral 
sequences and proteins.??4?7 

4 LI expression is differentially regulated in pluripotent stem cells of 
humans and other great apes.?* 
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of simple repeat sequences.?9?? As with non-coding 
genomic regions that show accelerated evolution or are 
specific to primates or humans?^??!! (Chapters 10 and 13), 
single cell studies confirm the highly regulated expres- 
sion of non-coding RNAs in specific brain areas relevant 
to human evolution and neurological diseases.??-?^ 

There is considerable evidence for the involvement 
of regulatory RNAs in brain evolution, development and 
function “by virtue of their abundant sequence innova- 
tion in mammals and plausible mechanistic connections 
to the adaptive processes that occurred recently in the 
primate and human lineages",499^' including the wide- 
spread use of IncRNAs to nucleate specialized domains 
in the neuronal nucleus;^* the expression of primate- 
restricted KRAB zinc finger proteins in specific regions 
of the developing and adult brain, which target cognate 
TE regulatory elements to control neuronal differentia- 
tion;?25326549.50 the widespread use of 3'UTRs as regula- 
tory RNAs?!?? to, e.g., maintain axonal integrity; and 
the massive expansion of RNA editing during vertebrate 
evolution, especially in primates. 

Small RNAs, IncRNAs, TEs and networks thereof are 
involved in neuronal differentiation, synaptic plasticity," 
long-term potentiation, learning and behavior,?46354366 
examples being the role of Malat/ in synapse formation;?9" 
the response of Gomafu to neuronal activity and its mod- 
ulation of schizophrenia-associated alternative splicing 
and response to methamphetamine; the response of 
Neatl to neuronal activity and its role in mediating histone 
methylation, age-related memory impairment and behav- 
ioral responses to stress;???7 the role of the IncRNA 
Gas5 in cocaine action and addiction??? the role of the 
TE-derived neuronal IncRNA BCI: in anxiety, explor- 
atory behavior?" and memory;?? the regulation of impul- 
sive and aggressive behaviors by IncRNA MAALIN;?? 
the downregulation of the primate-specific IncRNA 
LINCO00473 in the prefrontal cortex of depressed females 
but not males, accompanied by female-specific changes 
in synaptic function;*8% blockage by the loss of function 
of IncRNA Meg3 of the glycine-induced increase of the 
GluAI subunit of AMPA receptors on the plasma mem- 
brane, a major hallmark of LTP;*! the role of IncRNAs 
Tsx in hippocampal short-term memory formation“? and 
LoNA in long-term memory formation;** the regulation 
of social hierarchy in mice by IncRNA AtLAS;%% the 
regulation of locust aggregation by IncRNA PAHAL;35 


* Which “may drive species-specific changes in cognition"? 

* BCL” regulates dopamine receptors?” and associates with the frag- 
ile X syndrome protein FMRP to regulate the translation of specific 
mRNAs at synapses.?7% 
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FIGURE 17.5 Rag/ expression in mouse brain. In situ hybridization of parasagittal sections of (a) 10 day postnatal and (b) adult 
brain hybridized with RAG-1 antisense riboprobes, counterstained with cresyl violet to reveal the location of cell bodies. The 
strongest antisense hybridization was observed in the cerebellum and hippocampus. (Reproduced from Chun et al.*% with permis- 


sion of Elsevier.) 


and IncRNA dynamics in the behavioral transition from 
nurses to foragers in Drosophila.38 Moreover, epigenetic 
processes (which are likely RNA-directed, Chapter 16) 
are required for synaptic processes, cognition, learn- 
ing and memory — if epigenetic processes are disrupted, 
learning is also disrupted.^10.15.387-390 

Consequently, it appears that the brain adapted pro- 
cesses and regulatory mechanisms that are relatively 
hard-wired in development, rendering them soft-wired to 
enable the formation of synaptic networks that are tuned 
by environmental cues and cell communication to pro- 
cess, store and recall information. 

There are several other intriguing aspects of brain 
molecular biology. 

First, most if not all components of the innate and adap- 
tive immune systems have paralogs and orthologs expressed 
in the brain, most of which also occur in invertebrates 
(before the appearance of the adaptive immune system in 
vertebrates), suggesting that the adaptive immune system is 
a specialized offshoot of cell recognition pathways that first 
evolved for communication in the nervous system.' 

For example, the immunoglobulin (Ig) fold is pres- 
ent in most neuronal adhesion and receptor molecules, 
including the N-CAMs, myelin-associated glycoproteins, 
nectins, telencephalin, contactin and neuroglian, as well 
as in other proteins that are found in the immune system 
but also occur in brain, including the Toll-like receptors, 
CD4, Thy-1, the major histocompatibility complex and 
the complement family.??»-40! 

Toll-like receptors, thought mainly to activate innate 
immune responses, also regulate development, neural 
morphogenesis and neural connectivity, ^40 in part 


t The blood brain barrier may be a mechanism to prevent the two sys- 
tems from interfering with each other,??'5?? as they may in pathologi- 
cal situations.395394 


by recognizing neurotrophins to control neuronal sur- 
vival and death and acting as adhesion molecules to 
instruct axon and dendrite targeting and synaptic partner 
matching.404 

Cytokines, some of which are expressed in inverte- 
brate glial cells,*% are also present in the brain, where 
they affect neurotransmitter production, leading to 
changes in motor activity, anxiety, arousal and alarm. 
They also regulate sleep and a variety of neuroendocrine 
functions, as well as neuronal development.+06407 

The RAGI and RAG2 proteins that have their ori- 
gins as transposases and mediate VDJ recombination 
in B-cell and T-cell receptors are also expressed in the 
nervous system^09-4!! (Figure 17.5) V(D)J recombina- 
tion, like programmed genomic rearrangements and 
DNA repair in other organisms, is RNA directed.+12-414 
Lack of RAG-1 impairs memory formation^? and lack of 
RAG-2" impairs retinal development, axonal growth and 
navigation.^!! Somatic gene recombination also occurs in 
human neurons’ with similarities to V(D)J recombination 
but with a different mechanism that involves an RNA 
intermediate and reverse transcription.^! 

Second, while still largely a black box, neurons regu- 
late activity differentially at thousands of synapses, which 
involves transport along microtubules in both neurons 
and associated oligodendrocytes” of RNA granules??? 
containing mRNAs and non-coding RNAs (including 
miRNAs, antisense pseudogene transcripts, and Alu- and 


" RAG? activity is regulated by a PHD domain that binds histone 
H3K4me3 modifications, indicating an interplay between epigenetic 
information and DNA recombination.*'® 

" Widespread mosaic somatic gene recombination in neurons was 
first detected in the Alzheimer's disease-related gene APP, which 
encodes amyloid precursor protein.*” 

“ There is also evidence of RNA transport between glial cells and 
neurons.^19420 
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other TE-containing RNAs) via various RNA-binding 
proteins and motor proteins.354,364,423-427 There is synapse- 
specific local translation and, in all likelihood, context- 
dependent processing, editing and modifications of RNAs 
in response to activity.?5*48-430 A number of IncRNAs 
have been shown to be regulated by neuronal activity and 
to regulate synaptic protein localization and translation, 
as well as synapse density, morphology, dendritic tree 
complexity, activity, plasticity and stability.376,378,354,431-434 
The neuronally expressed gene Arc, which is essential for 
synaptic plasticity and memory formation, 5-47 encodes 
a repurposed retrotransposon-derived protein that medi- 
ates intercellular RNA transfer.*98 

Synaptic protein synthesis associated with memory 
formation is also regulated by the RNA interference 
pathway,**? including by Mili-bound (26—28nt) piRNAs, 
thought to be mainly involved in repressing TEs in the 
formation of germ cells but also highly expressed in the 
brain.356440-442 The piRNA pathway is also required for 
adult neurogenesis in mice*% and is involved in the regu- 
lation of transposon mobilization in Drosophila brain.^* 
The loss of Mili results in hypomethylation of LINEI 
promoters and behavioral deficits such as hyperactivity 
and reduced anxiety.^^ 

Third, neuronal and protein transport between the cell 
body and synapses is bidirectional,“ which has been 
interpreted as a "sushi-train" mechanism to patrol syn- 
apses,*46447 but retrograde transport back to the nucleus 
may also return information, as a mechanism for con- 
solidating long-term memory??? (see below). Loss of the 
RNA-binding and transport proteins Staufen and Pumilio 
leads to the inability to form long-term memory.**® 
Staufen also binds Alu sequences.^^? 

Fourth, there is widespread transcription at neuro- 
nal activity-regulated enhancers.?? Neuronal enhancers 
are hotspots for DNA single-strand break repair?! and 
such hotspots occur at sites involved in neuronal iden- 
tity, synapse function and neural cell adhesion, which are 
enriched in RNA-binding proteins.5?^^? Brain activity 
and fear conditioning causes DNA double-strand breaks 
in neurons?^^5 and activity-induced DNA breaks gov- 
ern the expression of neuronal genes.^? An IncRNA is 
required for DNA damage response in neurons, the loss 
of which causes Purkinje cell degeneration and impairs 
motor function.*” The DNA repair-associated protein 
gadd45y is required for the consolidation of associative 
fear memory.’ DNA repair is focused on transcribed 
genes and declines with age — possibly associated with 
reduced learning activity — with deficiencies in DNA 
repair linked to both developmental and age-associated 
neurodegenerative diseases.*? 
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Moreover, there are many unusual “DNA repair” 
enzymes and DNA polymerases in the brain,* some with 
reverse transcriptase activity." While these phenomena 
are usually interpreted in terms of protecting the genomic 
integrity of post-mitotic neurons, an alternative (and not 
mutually exclusive) possibility is that RNA-directed 
changes to DNA (“re-writing to disc”) is involved in long- 
term memory formation.??? The human DNA polymerase 
Pol0, which occurs in the brain, was recently shown to 
reverse transcribe RNA and promote RNA-templated 
DNA repair.*! PolO also promotes repeat expansions 
in Huntington's Disease and other neurodegenerative 
disorders.^9? 


RNA-DIRECTED TRANSGENERATIONAL 
EPIGENETIC INHERITANCE 


The foundational work on RNA interference in C. elegans 
showed that small RNA-mediated gene silencing can be 
inherited for many generations (Chapter 12),!4463-465 most 
stably when provoked by maternal piRNAs.*% piRNAs 
are required to initiate inheritance but not to maintain it, 
although maintenance is dependent on the nuclear RNAi 
pathway.*°°-4© Similar transgenerational inheritance 
is observed plants where the inheritance is confined to 
dsRNAs targeting methylation of gene promoters rather 
than the transcribed sequence.* The existence of many 
imprinted alleles shows that epigenetic information can 
also be transmitted through meiosis in mammals to con- 
trol gene expression in the next generation. 

Although their evolutionary significance has not been 
widely discussed, the observations in C. elegans refute 
the long-standing assumption that the soma cannot com- 
municate with the germline, since the inherited dsRNA 
triggers can be delivered by injection into somatic cells 
or ingestion of engineered bacteria producing dsRNAs, 
opening a new frontier for understanding the nature of 
both hard- and soft-wired inheritance. 

Unusual non-Mendelian patterns of inheritance were, 
in fact, first observed in peas by Bateson in 1915,*! and 
occasionally thereafter in other plants, but were not stud- 
ied in a systematic way until Robert Brink, who coined 
the term ‘paramutation’ in the 1950s to describe the atyp- 
ical inheritance of traits displayed by particular alleles in 
maize (where it is best studied),^?-^^ tomato and other 
species*”>-4”7 (Chapter 5). 


* These enzymes include DNA polymerase C, which performs RNA- 
directed DNA modification.*% Some are themselves subject to RNA 
editing, suggesting a molecular mechanism for context-dependent 
strength of memory formation.?% 


b1 paramutation in maize 


MOP1 (Required for 
transcriptional silencing) 


mor pl 
(Required for Bo change to B^) 


RNA, the Epicenter of Genetic Information 


Kit paramutation in mice 


KjgmAIfe 


IB AFB 


Pon 


Kit" 


( Genotypically Kit") 


Sang 


FIGURE 17.6 Paramutation at the b/ locus in maize (left; arrow at top left indicates the position of seven tandem repeats of a 
853 bp sequence unique to this location within the maize genome) and (right) at the kit locus in mouse (paramutable and paramu- 
tagenic Kit""/4 alleles), which confers white tail tips. (Reproduced from Chandler^? with permission of Elsevier.) 


Paramutation is RNA-directed transgenerational 
inheritance."? It may be summarized as the transfer of 
epigenetic information from one allele of a gene to another 
to induce metastable silencing, which is heritable for 
generations but incompletely penetrant and reversible." 
Importantly, paramutation is a somatic phenomenon, 
which is transmitted to the germline."^ Mechanistically, 
paramutation is transacted by small RNAs that direct 
RNA-processing and/or chromatin-modifying proteins to 
RNA transcripts or DNA in a sequence-specific manner, 
with self-reinforcing feedback loops that involve differen- 
tial methylation and the biogenesis of particular classes of 
sRNAs, including piRNAs, mediated by Argonaute pro- 
teins.^747480 Moreover, mutations in Mop! (‘Mediator 
of paramutationl”), a component of the RNA-directed 
DNA methylation pathway responsible for methylation of 
TEs adjacent to transcriptionally active genes, alter the 
distribution and frequency of meiotic recombination in 
maize,**! indicating a link between transposons and plas- 
ticity of inheritance. 

While initially thought to be confined to plants, para- 
mutation also occurs in animals+74848 (Figure 17.6). 
In C. elegans, where animal RNA interference was 


discovered, the phenotypic consequences of ectopically 
introduced small dsRNAs can persist for many genera- 
tions, as with plant paramutation, without any change 
to the underlying sequence.*%%:48* Paramutation has 
also been described in Drosophila and mouse, the lat- 
ter affecting a wide spectrum of characteristics such as 
pigmentation, cardiac hypertrophy, embryo development 
and axonal growth (the latter via a IncRNA), again medi- 
ated through the RNA interference pathway and small 
interfering RNAs.474824835485-457 The inheritance of para- 
mutations, at least in mouse, requires the RNA methyl- 
transferase Dnmt2.*88 

Paramutation is associated with, and its strength 
is often dependent upon the length of, simple (usually 
dinucleotide or trinucleotide) tandem sequence repeats 
(STRs) in the locus," an interesting observation in view 
of the vast number of STR sequences that occur in ani- 
mal and plant genomes. 


* STRs are also referred to as ‘variable number tandem repeats’ 
(VNTRs) or ‘microsatellites’, variation in which was exploited for 
DNA fingerprinting by Alex Jeffries in the late 1980s.% 
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The human genome contains over one million highly 
polymorphic and plastic STRs, whose mitotic (somatic) 
and meiotic (germline) expansion/contraction rates can be 
orders of magnitude higher than single nucleotide muta- 
tions, and impart a continuum of effects.^0-49 Variation 
in STRs is associated with neurological and psychiatric 
disorders? such as autism, and cancer, as well as the modu- 
lation of physiological and neurological traits, including 
circadian rhythms, sociosexual interactions, intelligence, 
hormone sensitivity, cognition, personality, addiction, neu- 
ronal differentiation, brain development and behavioral 
evolution.+?2-499 An STR has also been shown to regulate 
the transcription of the hTERT (human telomerase reverse 
transcriptase) gene in a cell-context-dependent manner, 
the absence of which results in telomere shortening, cel- 
lular senescence and impaired tumor growth.5% 

STRs are enriched in promoters and enhancers, asso- 
ciated with differential DNA methylation, and account 
for 10%-15% of the genetic variation observed in com- 
plex traits, making them substantial contributors to the 
missing heritability in genome-wide association studies 
that only poll haplotype blocks.49349°50!5 This is exem- 
plified by the paramutation-like behavior imposed by 
untransmitted paternal alleles of the human insulin gene 
that have expanded numbers of repeats associated with 
a reduced incidence of type 1 diabetes.5% Interestingly, 
a positively selected brain-specific IncRNA contains a 
tandem repeat that varies among individuals and affects 
its stability.5% A large and highly polymorphic human- 
specific 30bp tandem repeat located within an intron of a 
gene encoding a calcium channel has been shown to act 
as an enhancer and to be associated with bipolar disorder 
and schizophrenia. 

The mutation rate in STRs is controlled, in part, by 
epigenetic processes, and the length of STRs, including 
in humans, can be modulated by environmental param- 
eters, notably (as the main one studied) stress.476477,506;507 
As summarized by Jay Hollick: “The ability of heritable 
epigenetic regulatory information on one homologue to 
be copied to the other represents a potential adaptation of 
the diploid condition to rapidly disseminate a memory of 
environmental responses to future generations". ^77 

The extent of STR modulation of gene expression may 
be underestimated and underappreciated. It is usually 
only reported in model organisms where structured pedi- 
grees and genotypes can be constructed and monitored. 
A genome-wide analysis in tomato identified “thousands 
of candidate regions for paramutation-like behaviour... 


* Reasonably proposed to be 'the pathological ends of phenotypic bell 
curves in which healthy individuals occupy the middle territory'.^?? 
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the methylation patterns for a subset of (which) segre- 
gate with non-Mendelian ratios, consistent with second- 
ary paramutation-like interactions to variable extents 
depending on the locus”.5% It would be surprising if this 
was not a general phenomenon, with enormous implica- 
tions for understanding plant and animal biology and 
evolvability (Chapter 18). 

While most studies of paramutation have involved 
focused analyses of physical traits, there have been a 
number of reports that transgenerational epigenetic 
inheritance (in animals as diverse as worms, planarians 
and mammals) extends to metabolic, endocrine, immu- 
nological and cognitive experience, including trauma and 
stress, as well as learned behaviors.!?5095!8 There is both 
male and female transmission.'*912322 Various studies 
have implicated IncRNAs, miRNAs, CTCF recruitment 
to an Fto* enhancer, peroxisome proliferator-activated 
receptor (PPAR) pathways, cysteine synthases and 
tRNA fragments, and modifications thereof, as being 
involved.478.516.519,525-534 

One of these RNAs, the vault RNA VTRNA2-1, was 
identified in genome-wide screens as being “as a top 
environmentally responsive epiallele”,5% which has been 
associated with effects on oocytes of preconceptual alco- 
hol consumption?? and elsewhere with cancer etiology 
and outcomes.5%%5% The progeny of mice that survived a 
sublethal systemic infection with Candida albicans or an 
endotoxin dose exhibited cellular, developmental, tran- 
scriptional and epigenetic changes in the myeloid pro- 
genitor cell compartment, with enhanced responsiveness 
to endotoxin challenge and improved protection against 
systemic heterologous bacterial infections.* The sperm 
DNA of parental male mice infected with C. albicans 
showed DNA methylation differences linked to immune 
gene loci. piRNAs derived from ancient processed 
viral pseudogenes transmit transgenerational sequence- 
specific immune memory in rodents and primates.5% 
There are also innate fears and phobias that clearly have 
appreciable genetic components.?4^054! 

It has been shown that specific tRNA fragments 
(Chapter 12) in mice are transferred from the epididy- 
mis to maturing sperm through small vesicles called 
epididymosomes. These tRNA fragments are then con- 
veyed by the sperm to the fertilized egg where they influ- 
ence the expression of specific genes associated with the 
retroelement MERV during embryonic development. In 


*: Fto encodes an N6-methyladenosine demethylase that is associated 
with obesity??? and is required for adipogenesis.?? It also regulates 
neurogenesis, neural circuitry, memory formation and locomotor 
responses.?+98.105,524 Recruitment of CTCF to an Fto enhancer is 
involved in transgenerational inheritance of obesity.?!6 
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mouse, changes to the paternal diet modulate the tRNA 
fragment composition of epididymosomes, with a con- 
sequent alteration of specific metabolic pathways in the 
offspring.??6527 

The mechanisms by which experience is conveyed 
intergenerationally are ill-defined, and the field will 
remain controversial until this is understood. Not only is 
it difficult to separate genetic and epigenetic effects from 
cultural and environmental influences, 47954254 it is also 
difficult to conceive a mechanism to transmit complex 
traits, even considering the possibility of RNA signaling 
between the soma and the germline. Relevant perhaps is 
that, like the brain, the testis is an immunologically privi- 
leged tissue, with a tight barrier between blood vessels 
and the Sertoli cells in the seminiferous tubule, which 
isolates the later stages of sperm development.54^6547 

Whatever the details, it is nonetheless already clear 
that there are two forms of inheritance — gene alleles and 
epialleles — hard-wired DNA sequence information and 
RNA-directed epigenetic information that is responsive 
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to and influenced by environmental factors, directly con- 
tradicting the fixed inheritance of variation assumed by 
the Modern Synthesis. 
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8 Beyond the Jungle of Dogmas 


“In science, ‘perverse’ cracking of pots is some- 
times necessary to design new experiments, and 
to imagine new ideas that could help one to forge 
ahead, beyond the actual jungle of dogmas and pre- 
conceived opinions." 


(Klaus Scherrer!) 


THE MISUNDERSTANDING OF 
MOLECULAR BIOLOGY 


It seems that the nature of genetic information in com- 
plex organisms has been misunderstood since the incep- 
tion of molecular biology, because of the assumption 
that most genetic information is transacted by proteins. 
This assumption holds largely true for prokaryotes and 
to a lesser extent for eukaryotic microorganisms, which 
mainly must organize a cell to obtain nutrients and repro- 
duce, albeit itself no mean feat. However, developmen- 
tally complex organisms, especially motile animals, have 
had to evolve much more sophisticated mechanisms to 
orchestrate cell differentiation and their assembly into 
highly organized ensembles.?? 

The foundational assumption that genes (only) encode 
proteins led to many subsidiary assumptions, primarily 
that the vast tracts of non-coding and repetitive sequences 
in the genomes of complex organisms are graveyards of 
evolutionary junk colonized by molecular parasites. This 
rationalization was not disturbed by contrary genetic and 
molecular evidence assembled by Barbara McClintock, 
Roy Britten and Eric Davidson, Ed Lewis and others whose 
intuitions were ignored. It persisted in handwaving about 
the power of combinatorial control of gene expression by 
transcription factors. It persisted in founder fallacies and 
validation creep, a notable one being that developmental 
‘enhancers’ bring transcription factors bound at their pro- 
moters into contact with the promoters of protein-coding 
genes whose expression they control, rather than the now 
increasingly clear alternative that they produce regulatory 
RNAs that organize local transcription and splicing hubs. 
Indeed, understanding enhancers is key to understanding 
the programming of differentiation and development. 

Contrary to the view that the genomes of humans and 
other complex organisms are full of non-functional evolu- 
tionary detritus, they are in fact replete with information 
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and dynamic activity. The human genome makes tril- 
lions of cell fate decisions during development with 
high specificity and near-perfect reproducibility. This 
precision is effected by epigenetic mechanisms directed 
by regulatory RNAs, whose versatility may be largely 
achieved by programmed (cell state-specific) alternative 
splicing to alter target sites and modular recruitment of 
different types of effector proteins, incorporated into 
decisional hierarchies networked by sequences coopted 
from and distributed by transposable elements. 

The separation of signal from consequent action, exem- 
plified by RNAi and CRISPR, and writ large by enhanc- 
ers and other types of IncRNAs, is a highly, and likely the 
most, efficient and versatile means of gene regulation. The 
advent of these advanced RNA-based regulatory systems 
permitted the emergence of developmentally complex 
organisms, following which evolution experimented with 
more sophisticated designs to colonize new niches, leading 
to the extraordinary biodiversity that we see today, embel- 
lished by sexual (mate) selection as proposed by Darwin.* 
Most differences between species and individuals are 
embedded in variations in their regulatory architecture, 
the extent of which expands with developmental com- 
plexity, like increasingly elaborate building plans using a 
relatively generic set of component parts, albeit with occa- 
sional important innovations, such as the immunoglobulin 
domain, Arc RNA transfer and RNA editing proteins. 

Meanwhile, behind the scenes, while all organisms 
benefit from information processing, animals were climb- 
ing the next mountain, cognition, by superimposing plas- 
ticity on hardwired genomic information (Figure 18.1), 
with selection strength dependent on mobility and 
boosted by dexterity.^? It may be no accident, contrary 
to Steven Jay Gould's proposition of contingent history"? 
that the most cognitively advanced vertebrates and inver- 
tebrates are the primates and cephalopods. 

It is increasingly clear that, rather than a simple inter- 
mediate between gene and protein, RNA is the compu- 
tational engine of the cell, development, cognition and 
evolution." The challenge is not only to understand the 
principles of how RNA interacts with effector proteins in 


2 This is not to say there every sequence is functional. There will, of 
course, be recent duplications and transpositions that have not yet 
been subject to evolutionary selection. 
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the formation and function of dedicated cellular domains 
to orchestrate cell division or differentiate decisions, but 
to decipher the code itself. This will be a bit like climb- 
ing inside a computer to try to work out how it can pro- 
duce and project three-dimensional images, except that 
the human genome does so while building itself and its 
internal computer the brain, with far more precision and 
complexity, from an amazingly compact information 
suite that is roughly equivalent in size to that which can 
be held on a 1 Gb thumb drive. 


THE EVOLUTION OF EVOLVABILITY 


The other long-standing assumptions have been that 
mutations are random, and that experience cannot be 
communicated to modulate the phenotype of subsequent 
generations, asserted in the formative years of evolution- 
ary biology and molecular genetics based on preconcep- 
tion rather than evidence. Both assumptions are clearly 
incorrect, with non-random mutation and epigenetic 
inheritance now well documented in both plants and ani- 
mals,!2-18 and evidence of a relationship.” 


This raises the question of whether there is interplay 
between genetic and epigenetic inheritance to accelerate 
evolutionary processes. It is obvious that evolution can- 
not have proceeded by random search alone — the number 
of variables is too great. This problem was recognized 
generically two decades ago by Rodney Downey and 
Michael Fellows, who pointed out that in large complex 
systems random searches become computationally intrac- 
table (“NP-hard”) because of the exponential increase in 
the possibilities.202! This problem must also apply to evo- 
lution if it operates, as described by Dennett, as a grinding 
algorithm of generate and test,? and becomes increasingly 
acute in organisms, notably birds and mammals, that have 
long generation times and small numbers of progeny.” 


^ The numbers of variation options in such organisms per generation 
must be miniscule without compensatory mechanisms, such as the 
enigmatic double round of piRNA expression in sperm development, 
posited to allow controlled transposon mobilization and subsequent 
siRNA-mediated transcriptional proofing (which most sperm fail), to 
generate viable options for evolutionary selection in small popula- 
tions with long generation times.???^ There are high primary rates of 
retrotransposition in mammals.? 
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Downey and Fellows's proposed solution to the prob- 
lem is to define the most productive subspace and opti- 
mal tactics to decrease the complexity of the search and 
increase the chances of productive outcomes, termed 
‘Parameterized Complexity'???! Logically, in the case 
of evolution, any (initially random) event that enhanced 
the evolvability of the lineage concerned must have been 
subject to second-order selection on the basis of its stra- 
tegic advantage. By extension, any lineage that stumbled 
into such strategic advantage would come to dominate the 
evolutionary landscape and, by definition, be part of the 
toolkit of most, if not all, extant lineages in the biosphere. 

The evidence to support this logic is fragmentary, and 
the topic of the evolution of evolvability has been subject 
to considerable speculation and debate.?* Evolutionary 
computer science has shown that random mutation, 
recombination and selection are not universally effec- 
tive in improving complex systems and that for adapta- 
tion to occur, these systems must acquire evolvability.2728 
Moreover, the modeling suggests that simple genotype— 
phenotype mapping is suboptimal, whereas the use of 
indirect developmental representations allow the reuse 
of code (modularity), and scaling up of the complexity of 
artificially evolved phenotypes, for example, in robotics, 
artificial life and morphogenetic engineering.” Indeed 
one important enabler of biological evolution is modular- 
ity,2830-% itself an evolved characteristic,” classically and 
graphically exemplified by simple homeotic mutations 
that convert insects from having two wings to four.** 

Different organisms have tuned their innate muta- 
tion frequencies to optimize the trade-off between sur- 
vival and evolvability, 45556 but this is only a blunderbuss 
approach. However, mutation frequency and transposon 
distribution vary across the genome and over time,!537 
as judged by indices of neutral evolution, although 
these indices are unreliable and the extant distribution 
is difficult to disentangle from selection and differential 
repair and recombination.!*38-42 Nonetheless, such varia- 
tion occurs in the extensive non-coding regions of the 
genomes of complex organisms, evidence of selection at 
one level or another. 

Rapid evolution has been documented in many spe- 
cies. It is, for example, observed in the human lineage,? 
correlating with bursts of Alu element invasion, reflecting 
the huge positive selection value of cognitive advance- 
ment in intra- and inter-species competition.64^ 

The interplay between genetic and epigenetic inheri- 
tance changes the dynamics of natural selection, as 
argued by Eva Jablonka and others.^^^? It also changes 
the interplay between genes and environment, and the 
dichotomy of ‘Nature versus Nurture’ assessments, 
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with profound evolutionary and social implications. 
Epigenome-associated mutation bias reduces the occur- 
rence of deleterious mutations in essential genes in 
Arabidopsis? and recent evidence indicates that mutation 
sites in sperm are non-random, associated with adapta- 
tion.?! There is also evidence that epigenetic changes can 
increase drug resistance in cancer cells until a genetic 
solution is found.*? 

If epigenetic information can provide transgenera- 
tional phenotypic advantage, it is not a long stretch to 
suggest that evolution may have found ways to convert 
this information into hardwired genetic changes to speed 
adaptive change. In fact, it would be foolish to reject 
the possibility out of hand. The infrastructure - RNA- 
mediated epigenetic inheritance and RNA-templated 
DNA repair — is in place. Has evolution learned how to 
learn? 

Genomes contain biological software encompassing 
codes for components, self-assembly, differentiation and 
reproduction, supplemented by information in parental 
cells and epigenetic memories. Not only has the data 
evolved, but also the data structures, implementation 
systems and search algorithms. We have some way to go 
to understand the complexity and beauty of genetic pro- 
gramming, but the best places to start are to accept that 
RNA plays a major role in the evolution and mechanics 
of developmental control and cognitive processes, and to 
keep an open and receptive mind, especially when we are 
once again surprised. 


A new scientific truth does not triumph by convincing 
its opponents and making them see the light, but rather 
because its opponents eventually die, and a new genera- 
tion grows up that is familiar with it. 


(Max Planck?) 
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